Objectives

With growing population, urbanization and industrialization, particulate pollution is increasingly becoming a biggest concern.As the particulate pollution increases there will be higher level of unhealthy air. Particulate matters are inhalable by humans and can induce various cardiovascular diseases.As various researches indicate other major pollutants and weather could be a precursor for PM2.5 pollution, this data analysis with given air quality data and weather aims at understanding the 2.5 microns particulate matter relation to other major pollutants and their concentration during different seasons in India for major cities. Predicting PM2.5 can help in taking control actions interms of better solid waster management, industrial wastes, increasing the green cover etc. The study will undertake following objectives

  1. Analysis of given key pollutants dataset to check if it confirms to Benford’s law for completeness and quality
  2. Analysis and comparison of 6 year (2015-2020) Air quality Index for 4 major cities for different seasons
  3. Analysis of different pollutants for major cities over different years
  4. Analysis of different pollutants across different Indian seasons over the day during different hours
  5. Understanding the Impact of different pollutants, temperature, season, precipitation on PM2.5
  6. Building different prediction model for PM2.5 using key pollutants, weather info and season for major cities using lm and rpart modeling
  7. Evaluation and comparison of performance of different models and selecting a optimal one
knit_print.data.frame = function(x, ...) {
  res = rmarkdown::paged_table(x)
  rmarkdown:::knit_print.data.frame(res)
}

registerS3method(
  "knit_print", "data.frame", knit_print.data.frame,
  envir = asNamespace("knitr")
)

Preliminary analysis

The preliminary analysis involves loading, inspection and cleaning of the Air quality and weather datasets. As part of this activity, primary inspection conducted on the data set to understand the different variables, types and missing information based on visual and numerical analysis. Data cleaning is performed to remove unusable data entries/rows with incomplete cases or missing data.Further various operations performed to create clean, reduced or combined data set for further analysis.

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(tidyr)
library(lubridate)
## 
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
## 
##     date, intersect, setdiff, union
library(ggplot2)
library(stringr)
library(lubridate)
library(leaflet)
library(lazyeval)
library(mosaic)
## Registered S3 method overwritten by 'mosaic':
##   method                           from   
##   fortify.SpatialPolygonsDataFrame ggplot2
## 
## The 'mosaic' package masks several functions from core packages in order to add 
## additional features.  The original behavior of these functions should not be affected by this.
## 
## Attaching package: 'mosaic'
## The following object is masked from 'package:Matrix':
## 
##     mean
## The following object is masked from 'package:ggplot2':
## 
##     stat
## The following objects are masked from 'package:dplyr':
## 
##     count, do, tally
## The following objects are masked from 'package:stats':
## 
##     IQR, binom.test, cor, cor.test, cov, fivenum, median, prop.test,
##     quantile, sd, t.test, var
## The following objects are masked from 'package:base':
## 
##     max, mean, min, prod, range, sample, sum
library(mosaicData)
install.packages("Datasets/statisticalModeling_0.3.0.tar.gz", repos = NULL, type = "source")
## Installing package into '/home/2023MCS120010/R/x86_64-pc-linux-gnu-library/4.3'
## (as 'lib' is unspecified)
install.packages("Datasets/rpart.plot_3.1.1.tar.gz", repos = NULL, type = "source")
## Installing package into '/home/2023MCS120010/R/x86_64-pc-linux-gnu-library/4.3'
## (as 'lib' is unspecified)
library(statisticalModeling)
## 
## Attaching package: 'statisticalModeling'
## The following objects are masked from 'package:ggformula':
## 
##     gf_abline, gf_bar, gf_boxplot, gf_counts, gf_density,
##     gf_density_2d, gf_frame, gf_freqpoly, gf_hex, gf_histogram,
##     gf_hline, gf_jitter, gf_line, gf_path, gf_point, gf_text
library(rpart)
library(rpart.plot)
city_day<-read.csv("./Datasets/Air_Quality/city_day.csv")
city_hour<-read.csv("./Datasets/Air_Quality/city_hour.csv")
station_day<-read.csv("./Datasets/Air_Quality/station_day.csv")
station_hour<-read.csv("./Datasets/Air_Quality/station_hour.csv")
stations<-read.csv("./Datasets/Air_Quality/stations.csv")
blr_wthr<-read.csv("./Datasets/Temperature_And_Precipitation_Cities_IN/Bangalore_1990_2022_BangaloreCity.csv")
chn_wthr<-read.csv("./Datasets/Temperature_And_Precipitation_Cities_IN/Chennai_1990_2022_Madras.csv")
dlhi_wthr<-read.csv("./Datasets/Temperature_And_Precipitation_Cities_IN/Delhi_NCR_1990_2022_Safdarjung.csv")
lucknw_wthr<-read.csv("./Datasets/Temperature_And_Precipitation_Cities_IN/Lucknow_1990_2022.csv")
mum_wthr<-read.csv("./Datasets/Temperature_And_Precipitation_Cities_IN/Mumbai_1990_2022_Santacruz.csv")
geo_data<-read.csv("./Datasets/Temperature_And_Precipitation_Cities_IN/Station_GeoLocation_Longitute_Latitude_Elevation_EPSG_4326.csv")
bhu_wthr<-read.csv("./Datasets/Temperature_And_Precipitation_Cities_IN/weather_Bhubhneshwar_1990_2022.csv")
rourkela_wthr<-read.csv("./Datasets/Temperature_And_Precipitation_Cities_IN/weather_Rourkela_2021_2022.csv")
rajas_wthr<-read.csv("./Datasets/Temperature_And_Precipitation_Cities_IN/Rajasthan_1990_2022_Jodhpur.csv")

Exploratory Analysis of datasets

#city_day
head(city_day)
summary(city_day)
##      City               Date               PM2.5             PM10        
##  Length:29531       Length:29531       Min.   :  0.04   Min.   :   0.01  
##  Class :character   Class :character   1st Qu.: 28.82   1st Qu.:  56.26  
##  Mode  :character   Mode  :character   Median : 48.57   Median :  95.68  
##                                        Mean   : 67.45   Mean   : 118.13  
##                                        3rd Qu.: 80.59   3rd Qu.: 149.75  
##                                        Max.   :949.99   Max.   :1000.00  
##                                        NA's   :4598     NA's   :11140    
##        NO              NO2              NOx              NH3        
##  Min.   :  0.02   Min.   :  0.01   Min.   :  0.00   Min.   :  0.01  
##  1st Qu.:  5.63   1st Qu.: 11.75   1st Qu.: 12.82   1st Qu.:  8.58  
##  Median :  9.89   Median : 21.69   Median : 23.52   Median : 15.85  
##  Mean   : 17.57   Mean   : 28.56   Mean   : 32.31   Mean   : 23.48  
##  3rd Qu.: 19.95   3rd Qu.: 37.62   3rd Qu.: 40.13   3rd Qu.: 30.02  
##  Max.   :390.68   Max.   :362.21   Max.   :467.63   Max.   :352.89  
##  NA's   :3582     NA's   :3585     NA's   :4185     NA's   :10328   
##        CO               SO2               O3            Benzene       
##  Min.   :  0.000   Min.   :  0.01   Min.   :  0.01   Min.   :  0.000  
##  1st Qu.:  0.510   1st Qu.:  5.67   1st Qu.: 18.86   1st Qu.:  0.120  
##  Median :  0.890   Median :  9.16   Median : 30.84   Median :  1.070  
##  Mean   :  2.249   Mean   : 14.53   Mean   : 34.49   Mean   :  3.281  
##  3rd Qu.:  1.450   3rd Qu.: 15.22   3rd Qu.: 45.57   3rd Qu.:  3.080  
##  Max.   :175.810   Max.   :193.86   Max.   :257.73   Max.   :455.030  
##  NA's   :2059      NA's   :3854     NA's   :4022     NA's   :5623     
##     Toluene            Xylene            AQI          AQI_Bucket       
##  Min.   :  0.000   Min.   :  0.00   Min.   :  13.0   Length:29531      
##  1st Qu.:  0.600   1st Qu.:  0.14   1st Qu.:  81.0   Class :character  
##  Median :  2.970   Median :  0.98   Median : 118.0   Mode  :character  
##  Mean   :  8.701   Mean   :  3.07   Mean   : 166.5                     
##  3rd Qu.:  9.150   3rd Qu.:  3.35   3rd Qu.: 208.0                     
##  Max.   :454.850   Max.   :170.37   Max.   :2049.0                     
##  NA's   :8041      NA's   :18109    NA's   :4681
#glimpse(city_day)
#str(city_day)
nrow(city_day)
## [1] 29531
ncol(city_day)
## [1] 16
city_day <- replace(city_day, city_day=="", NA)
cd_na_total_per <-data.frame(colname = names(city_day),Total_NAs = colSums(is.na(city_day)), "PerctOfNAs" = colSums((is.na(city_day)/sum(is.na(city_day)) * 100)))
cd_na_total_per
ggplot(cd_na_total_per, aes(x=colname, y=PerctOfNAs))+geom_col() 

maxna <-max(rowSums(is.na(city_day)))
cd_rowswithmaxna<-subset(city_day, rowSums(is.na(city_day)) == maxna)
#cd_rowswithmorena
head(cd_rowswithmaxna)
city_day<-subset(city_day, rowSums(is.na(city_day)) != maxna)
#city_day
city_day[58,]
sum(is.na(city_day[58,]))
## [1] 2
cd_rowswithonlyna<-sum(rowSums(is.na(city_day)) == 14)
cd_rowswithonlyna
## [1] 0
head(city_day)
city_day <- replace(city_day, city_day=="", NA)%>%drop_na(AQI_Bucket)
city_day$AQI_Bucket <- as.factor(city_day$AQI_Bucket)
city_day$Date <- as.Date(city_day$Date)
#str(city_day)
summary(city_day)
##      City                Date                PM2.5             PM10       
##  Length:24850       Min.   :2015-01-01   Min.   :  0.04   Min.   :  0.03  
##  Class :character   1st Qu.:2017-08-16   1st Qu.: 29.00   1st Qu.: 56.78  
##  Mode  :character   Median :2018-11-05   Median : 48.78   Median : 96.18  
##                     Mean   :2018-07-24   Mean   : 67.48   Mean   :118.45  
##                     3rd Qu.:2019-10-11   3rd Qu.: 80.92   3rd Qu.:150.18  
##                     Max.   :2020-07-01   Max.   :914.94   Max.   :917.08  
##                                          NA's   :678      NA's   :7086    
##        NO              NO2              NOx              NH3        
##  Min.   :  0.03   Min.   :  0.01   Min.   :  0.00   Min.   :  0.01  
##  1st Qu.:  5.66   1st Qu.: 11.94   1st Qu.: 13.11   1st Qu.:  8.96  
##  Median :  9.91   Median : 22.10   Median : 23.68   Median : 16.31  
##  Mean   : 17.62   Mean   : 28.98   Mean   : 32.29   Mean   : 23.85  
##  3rd Qu.: 20.03   3rd Qu.: 38.24   3rd Qu.: 40.17   3rd Qu.: 30.36  
##  Max.   :390.68   Max.   :362.21   Max.   :378.24   Max.   :352.89  
##  NA's   :387      NA's   :391      NA's   :1857     NA's   :6536    
##        CO               SO2               O3            Benzene       
##  Min.   :  0.000   Min.   :  0.01   Min.   :  0.01   Min.   :  0.000  
##  1st Qu.:  0.590   1st Qu.:  5.73   1st Qu.: 19.25   1st Qu.:  0.230  
##  Median :  0.930   Median :  9.22   Median : 31.25   Median :  1.290  
##  Mean   :  2.345   Mean   : 14.36   Mean   : 34.91   Mean   :  3.459  
##  3rd Qu.:  1.480   3rd Qu.: 15.14   3rd Qu.: 46.08   3rd Qu.:  3.340  
##  Max.   :175.810   Max.   :186.08   Max.   :257.73   Max.   :455.030  
##  NA's   :445       NA's   :605      NA's   :807      NA's   :3535     
##     Toluene            Xylene             AQI                AQI_Bucket  
##  Min.   :  0.000   Min.   :  0.000   Min.   :  13.0   Good        :1341  
##  1st Qu.:  1.028   1st Qu.:  0.390   1st Qu.:  81.0   Moderate    :8829  
##  Median :  3.575   Median :  1.420   Median : 118.0   Poor        :2781  
##  Mean   :  9.526   Mean   :  3.589   Mean   : 166.5   Satisfactory:8224  
##  3rd Qu.: 10.180   3rd Qu.:  4.120   3rd Qu.: 208.0   Severe      :1338  
##  Max.   :454.850   Max.   :170.370   Max.   :2049.0   Very Poor   :2337  
##  NA's   :5826      NA's   :15372
dropcol<-c("PM10", "Toluene", "Benzene", "Xylene")#, "NH3")
#city_day<-city_day[,dropcol,drop=FALSE]
city_day<-subset(city_day, select=-c(PM10, Toluene, Benzene, Xylene))#, NH3))
head(city_day)

With the initial glimpse on the city air quality data the following characteristics and anomalies were observed. The dataset contains different Indian cities Air pollutant information, air quality index and air quality bucket or index from year 2015 to 2020. The summary indicates there are incomplete or missing variable values through out the dataset and across different columns.The rows with maximum number of NAs has only the date and city name, those rows may not be useful and can be removed. AQI_Bucket has categorical variable and has missing values in various rows and that is not filled with NA and is of type character.The missing information is in different columns but it has useful measurements or values in other rows.This requires further analysis if those rows can be dropped or retained. There are rows with AQI and AQI_Bucket but missing values in certain columns, those rows can be retained considering there could be correlated information but the rows without AQI and AQI_Bucket can be considered for dropping.The missing values or null strings can be filled with NA and AQI_Bucket type can be converted to factor. Dropping columns or variables higher than 10% and the variables not required for analysis. Based on this dropping columns PM10, Benzene, Toluene and Xylene.

The analysis will only focus on major cities ie Bengaluru, Chennai, Delhi and Mumbai. The following dataset processing will focus to extract data only for the interested cities. After filtering the Dataset for specific cities, unique cities are checked on the resultant dataset to ensue if it contains only the required.

major_city_day<-city_day%>%filter(City %in% c("Chennai",  "Bengaluru", "Delhi","Mumbai"))
summary(major_city_day)
##      City                Date                PM2.5              NO        
##  Length:6568        Min.   :2015-01-01   Min.   :  1.72   Min.   :  0.46  
##  Class :character   1st Qu.:2016-09-05   1st Qu.: 28.24   1st Qu.:  6.87  
##  Mode  :character   Median :2018-03-17   Median : 46.09   Median : 11.42  
##                     Mean   :2018-01-08   Mean   : 65.01   Mean   : 20.60  
##                     3rd Qu.:2019-05-18   3rd Qu.: 75.44   3rd Qu.: 23.89  
##                     Max.   :2020-07-01   Max.   :685.36   Max.   :221.03  
##                                          NA's   :65       NA's   :39      
##       NO2              NOx              NH3               CO        
##  Min.   :  2.49   Min.   :  0.00   Min.   :  0.02   Min.   : 0.000  
##  1st Qu.: 15.90   1st Qu.: 15.28   1st Qu.: 19.28   1st Qu.: 0.720  
##  Median : 26.04   Median : 25.50   Median : 31.11   Median : 0.990  
##  Mean   : 31.47   Mean   : 35.21   Mean   : 40.05   Mean   : 1.526  
##  3rd Qu.: 41.03   3rd Qu.: 46.20   3rd Qu.: 46.22   3rd Qu.: 1.430  
##  Max.   :162.50   Max.   :254.80   Max.   :352.89   Max.   :48.070  
##  NA's   :39       NA's   :17       NA's   :1005     NA's   :18      
##       SO2              O3              AQI               AQI_Bucket  
##  Min.   : 0.73   Min.   :  1.80   Min.   : 20.0   Good        : 174  
##  1st Qu.: 5.07   1st Qu.: 23.31   1st Qu.: 79.0   Moderate    :2238  
##  Median : 7.89   Median : 34.70   Median :111.0   Poor        : 723  
##  Mean   :10.24   Mean   : 38.29   Mean   :151.7   Satisfactory:2651  
##  3rd Qu.:13.51   3rd Qu.: 48.73   3rd Qu.:188.0   Severe      : 245  
##  Max.   :71.56   Max.   :257.73   Max.   :716.0   Very Poor   : 537  
##  NA's   :112     NA's   :224
unique(city_day$City)
##  [1] "Ahmedabad"          "Aizawl"             "Amaravati"         
##  [4] "Amritsar"           "Bengaluru"          "Bhopal"            
##  [7] "Brajrajnagar"       "Chandigarh"         "Chennai"           
## [10] "Coimbatore"         "Delhi"              "Ernakulam"         
## [13] "Gurugram"           "Guwahati"           "Hyderabad"         
## [16] "Jaipur"             "Jorapokhar"         "Kochi"             
## [19] "Kolkata"            "Lucknow"            "Mumbai"            
## [22] "Patna"              "Shillong"           "Talcher"           
## [25] "Thiruvananthapuram" "Visakhapatnam"
unique(major_city_day$City)
## [1] "Bengaluru" "Chennai"   "Delhi"     "Mumbai"

Benfordness Check of the data set to see if the dataset is complete and usable for analysis

Benfordness is checked without using Benford.analyis package instead used only basic functions from packages covered under the course. In contrary to the belief that the leading digit of a numerical data set 1 through 9 has equal probability, where as Benford’s law states that leading digits are distributed in a specific,and nonuniform way.It follows a lograthmic pattern with P(d)=log10(1+1/d), d=1,2..9. This holds true for all naturally or randomly occuring numbers of large dataset. As the airquality data is something which occurs naturally and it random. Benfordness should be holding true for the dataset if it is complete and there are no manipulation.

firstDigit<-function(x){
 str_extract(x, "[123456789]")
  #as.numeric(x)
  #x=x^2
}
BenfordConfirmity<-function(x, y){
  x=floor(x) + signif(x, 2)
  if(0.000 <= x && x < 0.006)
      print(paste(y, "Dataset has Close confirmity with Mean abosolute Deviation", x))
  else if(0.006 <= x && x < 0.012)
    print(paste(y , "Dataset has Acceptable confirmity with Mean abosolute Deviation", x))
  else if(0.012 <= x && x <= 0.015)
     print(paste(y, "Dataset has Marginally Acceptable confirmity with Mean abosolute Deviation", x))
  else if(x > 0.015)
    print(paste(y, "Dataset has no confirmity with Mean abosolute Deviation", x))
}
major_city_day_first_digit <-city_day %>% mutate(across(seq(3:13), firstDigit))%>%subset(select=-c(1,2,12))

PM2.5FD_Counts<-as.vector(table(major_city_day_first_digit$PM2.5))
PM2.5first_digit_actual_vs_expected <- data.frame(
digit            = 1:9,
actual.count     = PM2.5FD_Counts,    
actual.fraction  = PM2.5FD_Counts / nrow(major_city_day_first_digit),
benford.fraction = log10(1 + 1 / (1:9))
)
PM2.5first_digit_actual_vs_expected<-PM2.5first_digit_actual_vs_expected %>%mutate(difference.fraction=abs(benford.fraction-actual.fraction))
PM2.5first_digit_actual_vs_expected
BenfordConfirmity(mean(PM2.5first_digit_actual_vs_expected$difference.fraction), "PM2.5")
## [1] "PM2.5 Dataset has no confirmity with Mean abosolute Deviation 0.017"
NOFD_Counts<-as.vector(table(major_city_day_first_digit$NO))
NOfirst_digit_actual_vs_expected <- data.frame(
digit            = 1:9,
actual.count     = NOFD_Counts,    
actual.fraction  = NOFD_Counts / nrow(major_city_day_first_digit),
benford.fraction = log10(1 + 1 / (1:9))
)
NOfirst_digit_actual_vs_expected<-NOfirst_digit_actual_vs_expected %>%mutate(difference.fraction=abs(benford.fraction-actual.fraction))
NOfirst_digit_actual_vs_expected
BenfordConfirmity(mean(NOfirst_digit_actual_vs_expected$difference.fraction), "NO")
## [1] "NO Dataset has no confirmity with Mean abosolute Deviation 0.016"
O3FD_Counts<-as.vector(table(major_city_day_first_digit$O3))
O3first_digit_actual_vs_expected <- data.frame(
digit            = 1:9,
actual.count     = O3FD_Counts,    
actual.fraction  = O3FD_Counts / nrow(major_city_day_first_digit),
benford.fraction = log10(1 + 1 / (1:9))
)
O3first_digit_actual_vs_expected<-O3first_digit_actual_vs_expected %>%mutate(difference.fraction=abs(benford.fraction-actual.fraction))
O3first_digit_actual_vs_expected
BenfordConfirmity(mean(O3first_digit_actual_vs_expected$difference.fraction), "O3")
## [1] "O3 Dataset has no confirmity with Mean abosolute Deviation 0.036"
SO2FD_Counts<-as.vector(table(major_city_day_first_digit$SO2))
SO2first_digit_actual_vs_expected <- data.frame(
digit            = 1:9,
actual.count     = SO2FD_Counts,    
actual.fraction  = SO2FD_Counts / nrow(major_city_day_first_digit),
benford.fraction = log10(1 + 1 / (1:9))
)
SO2first_digit_actual_vs_expected<-SO2first_digit_actual_vs_expected %>%mutate(difference.fraction=abs(benford.fraction-actual.fraction))
SO2first_digit_actual_vs_expected
BenfordConfirmity(mean(SO2first_digit_actual_vs_expected$difference.fraction), "SO2")
## [1] "SO2 Dataset has no confirmity with Mean abosolute Deviation 0.023"
ggplot(PM2.5first_digit_actual_vs_expected, aes(x=digit, y=benford.fraction))+ geom_col()+ geom_line(aes(y=actual.fraction, color='red')) + scale_x_continuous(name ="digit", breaks = seq(1, 9, 1)) + labs(title = 'Benfordness of PM2.5 in City_day')

The benfordness check was done on selected pollutants like PM2.5, NO, O3 and SO2. More it was done as random sampling check. As per the check all four data doesnt have enough representation to confirms to Benford’s law. “PM2.5 Dataset has no confirmity with Mean abosolute Deviation 0.017” “NO Dataset has no confirmity with Mean abosolute Deviation 0.016” “O3 Dataset has no confirmity with Mean abosolute Deviation 0.036” “SO2 Dataset has no confirmity with Mean abosolute Deviation 0.023”
The possible reasons could be lot of missing data, as the source is unknown potentially could have been manipulated or errors in the recording. This might impact the final model which may to require to revisit the data.

Analysis of AQI across major cities over different months for the period of 2015-2020

city_day_monthwise<-major_city_day%>%group_by(City, year=year(Date), month = month(Date))%>%summarise(across(c(seq(3:13)), mean, na.rm=TRUE))
## Warning: There were 225 warnings in `summarise()`.
## The first warning was:
## ℹ In argument: `across(c(seq(3:13)), mean, na.rm = TRUE)`.
## ℹ In group 1: `City = "Bengaluru"`, `year = 2015`, `month = 3`.
## Caused by warning:
## ! The `...` argument of `across()` is deprecated as of dplyr 1.1.0.
## Supply arguments directly to `.fns` through an anonymous function instead.
## 
##   # Previously
##   across(a:b, mean, na.rm = TRUE)
## 
##   # Now
##   across(a:b, \(x) mean(x, na.rm = TRUE))
## ℹ Run `dplyr::last_dplyr_warnings()` to see the 224 remaining warnings.
## `summarise()` has grouped output by 'City', 'year'. You can override using the
## `.groups` argument.
city_day_monthwise
## # A tibble: 224 Ă— 14
## # Groups:   City, year [21]
##    City    year month Date       PM2.5    NO   NO2   NOx   NH3    CO   SO2    O3
##    <chr>  <dbl> <dbl> <date>     <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
##  1 Benga…  2015     3 2015-03-26  48.1  3.34 22.3  15.8   28.3  5.53  3.22 60.1 
##  2 Benga…  2015     4 2015-04-15  28.7  3.89  8.88 10.6   23.6  3.83  8.31 63.8 
##  3 Benga…  2015     5 2015-05-16  33.5  5.84 22.3  13.7   11.9  5.49  5.27 29.4 
##  4 Benga…  2015     6 2015-06-15  18.3  2.72 13.5   9.17  11.5  3.21  6.61 21.0 
##  5 Benga…  2015     7 2015-07-16  21.0  4.12 10.1   7.86  12.3  9.55  4.04  7.53
##  6 Benga…  2015     8 2015-08-16  21.1  4.24 13.9   9.79  24.7 14.1   3.02  6.19
##  7 Benga…  2015     9 2015-09-15  20.2  4.94 15.8  12.2   33.2  5.97  2.52 15.7 
##  8 Benga…  2015    10 2015-10-16  31.4  5.29 28.1  20.8   31.5  2.28  3.13 17.9 
##  9 Benga…  2015    11 2015-11-15  30.8  5.48 17.6  17.4   27.5  1.61  2.80 19.9 
## 10 Benga…  2015    12 2015-12-16  46.0 16.2  43.5  32.6   30.3  1.64  5.65 21.5 
## # ℹ 214 more rows
## # ℹ 2 more variables: AQI <dbl>, AQI_Bucket <dbl>
ggplot(city_day_monthwise, aes(x=month, y=AQI, color=factor(year))) + geom_line() + facet_wrap(~City) + scale_x_continuous(name ="month", breaks = seq(1, 12, 1))

AQI over different months for major cities are highly varying. Delhi has very high AQI over 200 different months and even goes above 400 compared to other cities. The AQI is high during start of the year and really high during end of the year. Across different cities the common observation over the give years the AQI has improved for the respective city AQI benchmarks and AQI vary across different months. This clearly indicates that seasonal changes and weather has impact on the AQI. Mumbai appears to have missing data.

city_day_yearwise<-major_city_day%>%group_by(City, year=year(Date))%>%summarise(across(c(seq(3:13)), mean, na.rm=TRUE))%>%filter(City %in% c("Ahmedabad", "Chennai", "Bengaluru", "Delhi","Mumbai"))
## Warning: There were 21 warnings in `summarise()`.
## The first warning was:
## ℹ In argument: `across(c(seq(3:13)), mean, na.rm = TRUE)`.
## ℹ In group 1: `City = "Bengaluru"`, `year = 2015`.
## Caused by warning in `mean.default()`:
## ! argument is not numeric or logical: returning NA
## ℹ Run `dplyr::last_dplyr_warnings()` to see the 20 remaining warnings.
## `summarise()` has grouped output by 'City'. You can override using the
## `.groups` argument.
#city_day_yearwise
ggplot(city_day_yearwise, aes(x=year)) + geom_line(aes(y=PM2.5, color="PM2.5", na.rm=TRUE)) + facet_wrap(~City, nrow=5)
## Warning in geom_line(aes(y = PM2.5, color = "PM2.5", na.rm = TRUE)): Ignoring
## unknown aesthetics: na.rm

#sggplot(city_day_yearwise, aes(x=year)) + geom_bar(aes(y=NH3, color=City, na.rm=TRUE)) + facet_wrap(~City, nrow=5)

Similar to AQI Delhi has high PM2.5 concentration and next is chennai. Mumbai has low PM2.5 low PM2.5 concentration and no data available for 2015-2017. PM2.5 data for Mumbai may be incorrect given urbanization and industrialization in Mumbai. Bangalore has very low PM2.5 concentration.

ggplot(city_day_yearwise, aes(x=year)) + geom_line(aes(y=NO, color="NO")) + geom_line(aes(y=NO2, color="NO2")) + geom_line(aes(y=NOx, color="NOx"))+ facet_wrap(~City, nrow=5)

Compared to Chennai, Bangalore has high NO, NO2 and NOx concentration.Mumbai also higher concentration than Bangalore and chennai. Delhi similar to AQI and PM2.5, has higher concentration on NO, NO2 and NOx

ggplot(city_day_yearwise, aes(x=year)) + geom_line(aes(y=SO2, color="SO2")) + facet_wrap(~City, nrow=5)

Bangalore has very low SO2 concentration. SO2 concentration is high in both Mumbai and Delhi.

ggplot(city_day_yearwise, aes(x=year)) + geom_line(aes(y=CO, color="CO")) + facet_wrap(~City, nrow=5)

CO concentraion has reduced over the years for all the cities.

ggplot(city_day_yearwise, aes(x=year)) + geom_line(aes(y=O3, color="O3")) + facet_wrap(~City, nrow=5)

O3 concentration is higher in Delhi. Chennai and Bangalore had more or similar concentration. Mumbai has lowest O3 concentration.

Analysis of AQI and different pollutants for across major cities for different seasons over the day during different hours using City hour dataset

#city_hour
head(city_hour)
glimpse(city_hour)
## Rows: 707,875
## Columns: 16
## $ City       <chr> "Ahmedabad", "Ahmedabad", "Ahmedabad", "Ahmedabad", "Ahmeda…
## $ Datetime   <chr> "2015-01-01 01:00:00", "2015-01-01 02:00:00", "2015-01-01 0…
## $ PM2.5      <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ PM10       <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ NO         <dbl> 1.00, 0.02, 0.08, 0.30, 0.12, 0.33, 0.45, 1.03, 1.47, 2.05,…
## $ NO2        <dbl> 40.01, 27.75, 19.32, 16.45, 14.90, 15.95, 15.94, 16.66, 16.…
## $ NOx        <dbl> 36.37, 19.73, 11.08, 9.20, 7.85, 10.82, 12.47, 16.48, 18.02…
## $ NH3        <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ CO         <dbl> 1.00, 0.02, 0.08, 0.30, 0.12, 0.33, 0.45, 1.03, 1.47, 2.05,…
## $ SO2        <dbl> 122.07, 85.90, 52.83, 39.53, 32.63, 29.87, 27.41, 20.92, 16…
## $ O3         <dbl> NA, NA, NA, 153.58, NA, 64.25, 191.96, 177.21, 122.08, NA, …
## $ Benzene    <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ Toluene    <dbl> 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00,…
## $ Xylene     <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ AQI        <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ AQI_Bucket <chr> "", "", "", "", "", "", "", "", "", "", "", "", "", "", "",…
city_hour <- replace(city_hour, city_hour=="", NA)
ch_na_total_per <-data.frame(colname = names(city_hour),Total_NAs = colSums(is.na(city_hour)), "PerctOfNAs" = colSums((is.na(city_hour)/sum(is.na(city_hour)) * 100)))
ch_na_total_per
ggplot(ch_na_total_per, aes(x=colname, y=PerctOfNAs))+geom_col() 

maxna <-max(rowSums(is.na(city_hour)))
cd_rowswithmaxna<-subset(city_hour, rowSums(is.na(city_hour)) == maxna)
#cd_rowswithmorena
head(cd_rowswithmaxna)
city_hour<-subset(city_hour, rowSums(is.na(city_hour)) != maxna)
#city_hour
head(city_hour)
city_hour <- replace(city_hour, city_hour=="", NA)%>% drop_na(AQI_Bucket)
city_hour$AQI_Bucket <- as.factor(city_hour$AQI_Bucket)
#str(city_hour)
summary(city_hour)
##      City             Datetime             PM2.5              PM10        
##  Length:578795      Length:578795      Min.   :   0.01   Min.   :   0.01  
##  Class :character   Class :character   1st Qu.:  26.27   1st Qu.:  52.78  
##  Mode  :character   Mode  :character   Median :  46.50   Median :  92.00  
##                                        Mean   :  67.51   Mean   : 119.60  
##                                        3rd Qu.:  79.48   3rd Qu.: 148.14  
##                                        Max.   : 999.99   Max.   :1000.00  
##                                        NA's   :31035     NA's   :178863   
##        NO              NO2              NOx              NH3        
##  Min.   :  0.01   Min.   :  0.01   Min.   :  0.00   Min.   :  0.01  
##  1st Qu.:  3.87   1st Qu.: 10.96   1st Qu.: 10.97   1st Qu.:  8.50  
##  Median :  7.97   Median : 20.68   Median : 20.99   Median : 15.87  
##  Mean   : 17.50   Mean   : 29.27   Mean   : 32.34   Mean   : 23.98  
##  3rd Qu.: 16.14   3rd Qu.: 36.92   3rd Qu.: 37.46   3rd Qu.: 29.65  
##  Max.   :498.97   Max.   :499.51   Max.   :498.61   Max.   :499.97  
##  NA's   :21481    NA's   :22267    NA's   :51906    NA's   :163994  
##        CO               SO2               O3            Benzene      
##  Min.   :  0.000   Min.   :  0.01   Min.   :  0.01   Min.   :  0.00  
##  1st Qu.:  0.490   1st Qu.:  4.95   1st Qu.: 13.79   1st Qu.:  0.15  
##  Median :  0.840   Median :  8.46   Median : 26.64   Median :  1.09  
##  Mean   :  2.277   Mean   : 14.01   Mean   : 35.25   Mean   :  3.33  
##  3rd Qu.:  1.400   3rd Qu.: 14.77   3rd Qu.: 48.22   3rd Qu.:  3.03  
##  Max.   :498.570   Max.   :199.96   Max.   :497.62   Max.   :498.07  
##  NA's   :29146     NA's   :33058    NA's   :33372    NA's   :96365   
##     Toluene           Xylene            AQI                AQI_Bucket    
##  Min.   :  0.00   Min.   :  0.0    Min.   :   8.0   Good        : 38611  
##  1st Qu.:  0.75   1st Qu.:  0.3    1st Qu.:  79.0   Moderate    :198991  
##  Median :  3.14   Median :  1.2    Median : 116.0   Poor        : 66654  
##  Mean   :  9.50   Mean   :  3.7    Mean   : 166.4   Satisfactory:189434  
##  3rd Qu.:  9.56   3rd Qu.:  3.8    3rd Qu.: 208.0   Severe      : 27650  
##  Max.   :499.40   Max.   :500.0    Max.   :3133.0   Very Poor   : 57455  
##  NA's   :149001   NA's   :371700
head(city_hour)
tail(city_hour)

In city hour data set, Xylene, NH3, PM10 has greater 10% of NAs. Other pollutants has nearly 5% of NAs. Remove rows with all NAs and rows with AQI and AQI bucket with NAs. Further determining and assigning seasons to every entry in the dataset based on date.

season<-function(date){
    winter <- as.Date("2012-01-15", format = "%Y-%m-%d") # Winter 
    spring <- as.Date("2012-02-15",  format = "%Y-%m-%d") # Spring 
    summer <- as.Date("2012-03-15",  format = "%Y-%m-%d") # Summer 
    monsoon <- as.Date("2012-06-15",  format = "%Y-%m-%d") # Monsoon
    autumn <- as.Date("2012-10-15",  format = "%Y-%m-%d") # autumn 
    prewinter <- as.Date("2012-12-15",  format = "%Y-%m-%d") # prewinter 

    # Convert dates from any year to 2012 dates
    d <- as.Date(strftime(date, format="2012-%m-%d"))
    
    ifelse (d >= winter & d < spring, "Winter",
      ifelse (d >= spring & d < summer, "Spring",
        ifelse (d >= summer & d < monsoon, "Summer",
         ifelse (d >= monsoon & d < autumn, "Monsoon",
            ifelse (d >= autumn & d < prewinter, "Autumn", "Prewinter")))))
}

#season(as.Date("2015-11-20"))
#head(city_hour)
city_hour_sepdt<-city_hour %>% separate(Datetime, c( "Date", "Time"), " ")%>% filter(City %in% c("Chennai", "Bengaluru", "Delhi","Mumbai")) %>%mutate(Season=season(Date), Year = lubridate::year(Date))#%>%select(Year, Season, Time, PM2.5)
#head(city_hour_sepdt)

Change in AQI or pollutant concentration analysis is done for Chennai

city_hour_sepdt_byseason_ch<-city_hour_sepdt %>% filter(Year == 2020 & City == "Chennai") %>%  group_by(City, Year,Season, Time) %>%summarise(across(c(seq(4:18)), mean, na.rm=TRUE)) #%>% summarise(PM2.5_Mean= mean(PM2.5))
## Warning: There were 240 warnings in `summarise()`.
## The first warning was:
## ℹ In argument: `across(c(seq(4:18)), mean, na.rm = TRUE)`.
## ℹ In group 1: `City = "Chennai"`, `Year = 2020`, `Season = "Monsoon"`, `Time =
##   "00:00:00"`.
## Caused by warning in `mean.default()`:
## ! argument is not numeric or logical: returning NA
## ℹ Run `dplyr::last_dplyr_warnings()` to see the 239 remaining warnings.
## `summarise()` has grouped output by 'City', 'Year', 'Season'. You can override
## using the `.groups` argument.
city_hour_sepdt_byseason_ch
## # A tibble: 120 Ă— 19
## # Groups:   City, Year, Season [5]
##    City     Year Season  Time     Date PM2.5  PM10    NO   NO2   NOx   NH3    CO
##    <chr>   <dbl> <chr>   <chr>   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
##  1 Chennai  2020 Monsoon 00:00:…    NA  28.0  40.5  7.05  8.94  15.5  34.4  1.02
##  2 Chennai  2020 Monsoon 01:00:…    NA  23.2  41.8  8.36  8.97  16.9  34.4  1.03
##  3 Chennai  2020 Monsoon 02:00:…    NA  22.4  39.2  8.10  9.43  17.0  34.6  1.07
##  4 Chennai  2020 Monsoon 03:00:…    NA  22.8  40.2  7.83 10.0   17.2  34.5  1.09
##  5 Chennai  2020 Monsoon 04:00:…    NA  25.8  37.2  8.00 10.9   18.3  34.6  1.10
##  6 Chennai  2020 Monsoon 05:00:…    NA  34.8  31.3  8.20 11.7   19.5  35.5  1.09
##  7 Chennai  2020 Monsoon 06:00:…    NA  32.8  30.4  8.50 12.4   20.4  34.4  1.10
##  8 Chennai  2020 Monsoon 07:00:…    NA  27.1  32.6  9.18 13.4   22.2  34.2  1.15
##  9 Chennai  2020 Monsoon 08:00:…    NA  34.3  34.4  9.33 14.2   23.0  34.3  1.19
## 10 Chennai  2020 Monsoon 09:00:…    NA  34.9  31.4  8.84 12.4   20.8  34.5  1.18
## # ℹ 110 more rows
## # ℹ 7 more variables: SO2 <dbl>, O3 <dbl>, Benzene <dbl>, Toluene <dbl>,
## #   Xylene <dbl>, AQI <dbl>, AQI_Bucket <dbl>
ggplot(city_hour_sepdt_byseason_ch, aes(x=Time, y=PM2.5, color=factor(Season))) + geom_point()  + theme(axis.text.x = element_text(angle = 90)) #+ facet_wrap(~City, nrow=5)

ggplot(city_hour_sepdt_byseason_ch, aes(x=Time, y=O3, color=factor(Season))) + geom_point()  + theme(axis.text.x = element_text(angle = 90)) #+ facet_wrap(~City, nrow=5)

ggplot(city_hour_sepdt_byseason_ch, aes(x=Time, y=NO, color=factor(Season))) + geom_point()  + theme(axis.text.x = element_text(angle = 90))# + facet_wrap(~City, nrow=5)

ggplot(city_hour_sepdt_byseason_ch, aes(x=Time, y=CO, color=factor(Season))) + geom_point()  + theme(axis.text.x = element_text(angle = 90)) #+ facet_wrap(~City, nrow=5)

ggplot(city_hour_sepdt_byseason_ch, aes(x=Time, y=SO2, color=factor(Season))) + geom_point()  + theme(axis.text.x = element_text(angle = 90)) #+ facet_wrap(~City, nrow=5)

ggplot(city_hour_sepdt_byseason_ch, aes(x=Time, y=AQI, color=factor(Season))) + geom_point()  + theme(axis.text.x = element_text(angle = 90)) #+ facet_wrap(~City, nrow=5)

For Chennai 1. PM2.5 peaks from morning till afternoon 1PM and the concentration high during prewinter and winter season. During summer concentration is low. 2. O3 is very high during monsoon and high during summer. O3 peaks during the peak hours 9 am to 6 pm. 3. CO is very high during monsoon and high during winter. CO peaks first half of the day and drops during second of the day. 4. SO2 varies through out the day, high during winter and low during summer. 5. AQI is high during monsoon and low during summer and peaks during afternoon till midnight 6. NO is high during early morning and low during afternoons.

city_hour_sepdt_byseason_ah<-city_hour_sepdt %>% filter(Year == 2020 & City == "Bengaluru") %>%  group_by(City, Year,Season, Time) %>%summarise(across(c(seq(4:18)), mean, na.rm=TRUE)) #%>% summarise(PM2.5_Mean= mean(PM2.5))
## Warning: There were 240 warnings in `summarise()`.
## The first warning was:
## ℹ In argument: `across(c(seq(4:18)), mean, na.rm = TRUE)`.
## ℹ In group 1: `City = "Bengaluru"`, `Year = 2020`, `Season = "Monsoon"`, `Time
##   = "00:00:00"`.
## Caused by warning in `mean.default()`:
## ! argument is not numeric or logical: returning NA
## ℹ Run `dplyr::last_dplyr_warnings()` to see the 239 remaining warnings.
## `summarise()` has grouped output by 'City', 'Year', 'Season'. You can override
## using the `.groups` argument.
city_hour_sepdt_byseason_ah
## # A tibble: 120 Ă— 19
## # Groups:   City, Year, Season [5]
##    City       Year Season  Time   Date PM2.5  PM10    NO   NO2   NOx   NH3    CO
##    <chr>     <dbl> <chr>   <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
##  1 Bengaluru  2020 Monsoon 00:0…    NA  19.5  38.3  3.93  12.2  13.6  6.84 0.574
##  2 Bengaluru  2020 Monsoon 01:0…    NA  16.1  32.9  3.84  11.0  12.8  6.65 0.482
##  3 Bengaluru  2020 Monsoon 02:0…    NA  15.2  31.3  3.78  10.5  12.6  6.52 0.495
##  4 Bengaluru  2020 Monsoon 03:0…    NA  15.0  31.0  3.71  10.2  12.4  6.43 0.471
##  5 Bengaluru  2020 Monsoon 04:0…    NA  15.2  32.4  3.72  10.1  12.2  6.58 0.492
##  6 Bengaluru  2020 Monsoon 05:0…    NA  13.9  33.9  3.81  10.0  12.3  6.67 0.611
##  7 Bengaluru  2020 Monsoon 06:0…    NA  14.0  37.9  4.03  10.7  12.9  7.05 0.527
##  8 Bengaluru  2020 Monsoon 07:0…    NA  14.8  33.8  4.65  12.1  14.4  6.77 0.554
##  9 Bengaluru  2020 Monsoon 08:0…    NA  17.5  38.5  5.38  13.1  15.6  6.72 0.576
## 10 Bengaluru  2020 Monsoon 09:0…    NA  20.5  39.4  5.76  13.7  16.5  6.81 0.648
## # ℹ 110 more rows
## # ℹ 7 more variables: SO2 <dbl>, O3 <dbl>, Benzene <dbl>, Toluene <dbl>,
## #   Xylene <dbl>, AQI <dbl>, AQI_Bucket <dbl>
ggplot(city_hour_sepdt_byseason_ah, aes(x=Time, y=PM2.5, color=factor(Season))) + geom_point()  + theme(axis.text.x = element_text(angle = 90)) #+ facet_wrap(~City, nrow=5)

ggplot(city_hour_sepdt_byseason_ah, aes(x=Time, y=O3, color=factor(Season))) + geom_point()  + theme(axis.text.x = element_text(angle = 90)) #+ facet_wrap(~City, nrow=5)

ggplot(city_hour_sepdt_byseason_ah, aes(x=Time, y=NO, color=factor(Season))) + geom_point()  + theme(axis.text.x = element_text(angle = 90))# + facet_wrap(~City, nrow=5)

ggplot(city_hour_sepdt_byseason_ah, aes(x=Time, y=CO, color=factor(Season))) + geom_point()  + theme(axis.text.x = element_text(angle = 90)) #+ facet_wrap(~City, nrow=5)

ggplot(city_hour_sepdt_byseason_ah, aes(x=Time, y=SO2, color=factor(Season))) + geom_point()  + theme(axis.text.x = element_text(angle = 90)) #+ facet_wrap(~City, nrow=5)

ggplot(city_hour_sepdt_byseason_ah, aes(x=Time, y=AQI, color=factor(Season))) + geom_point()  + theme(axis.text.x = element_text(angle = 90)) #+ facet_wrap(~City, nrow=5)

For Bengaluru 1. PM2.5 peaks from morning till afternoon 12PM and the concentration high during spring and winter season. During Monsoon concentraion is low. 2. O3 is very high during Spring and high during winter and prewinter. O3 peaks during the peak during afternoon and midnight. 3. CO is very high during monsoon and high during winter. CO peaks first half of the day and drops during second of the day. 4. AQI drops during early morning and peaks in the afternoon, high during spring and low during monsoon. 5. SO2 is high during winter and low during summer and peaks during afternoon till midnight 6. NO high during winter and prewinter, peaks during early morning and late night

##Creating combined Data set with Station day and Weather for major cities. The below code extract the station Ids for the major cities that are being analyzed. The City name, Season, Year, Latitude, longitude, elevation data are also added to the combined data set for modeling.

head(stations)
summary(stations)
##   StationId         StationName            City              State          
##  Length:230         Length:230         Length:230         Length:230        
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##     Status         
##  Length:230        
##  Class :character  
##  Mode  :character
stations$City
##   [1] "Amaravati"          "Rajamahendravaram"  "Tirupati"          
##   [4] "Vijayawada"         "Visakhapatnam"      "Guwahati"          
##   [7] "Gaya"               "Gaya"               "Hajipur"           
##  [10] "Muzaffarpur"        "Patna"              "Patna"             
##  [13] "Patna"              "Patna"              "Patna"             
##  [16] "Patna"              "Chandigarh"         "Delhi"             
##  [19] "Delhi"              "Delhi"              "Delhi"             
##  [22] "Delhi"              "Delhi"              "Delhi"             
##  [25] "Delhi"              "Delhi"              "Delhi"             
##  [28] "Delhi"              "Delhi"              "Delhi"             
##  [31] "Delhi"              "Delhi"              "Delhi"             
##  [34] "Delhi"              "Delhi"              "Delhi"             
##  [37] "Delhi"              "Delhi"              "Delhi"             
##  [40] "Delhi"              "Delhi"              "Delhi"             
##  [43] "Delhi"              "Delhi"              "Delhi"             
##  [46] "Delhi"              "Delhi"              "Delhi"             
##  [49] "Delhi"              "Delhi"              "Delhi"             
##  [52] "Delhi"              "Delhi"              "Delhi"             
##  [55] "Delhi"              "Ahmedabad"          "Ankleshwar"        
##  [58] "Gandhinagar"        "Nandesari"          "Vapi"              
##  [61] "Vatva"              "Ambala"             "Bahadurgarh"       
##  [64] "Ballabgarh"         "Bhiwani"            "Dharuhera"         
##  [67] "Faridabad"          "Faridabad"          "Faridabad"         
##  [70] "Faridabad"          "Fatehabad"          "Gurugram"          
##  [73] "Gurugram"           "Gurugram"           "Gurugram"          
##  [76] "Hisar"              "Jind"               "Kaithal"           
##  [79] "Karnal"             "Kurukshetra"        "Mandikhera"        
##  [82] "Manesar"            "Narnaul"            "Palwal"            
##  [85] "Panchkula"          "Panipat"            "Rohtak"            
##  [88] "Sirsa"              "Sonipat"            "Yamuna Nagar"      
##  [91] "Jorapokhar"         "Bagalkot"           "Bengaluru"         
##  [94] "Bengaluru"          "Bengaluru"          "Bengaluru"         
##  [97] "Bengaluru"          "Bengaluru"          "Bengaluru"         
## [100] "Bengaluru"          "Bengaluru"          "Bengaluru"         
## [103] "Chamarajanagar"     "Chikkaballapur"     "Chikkamagaluru"    
## [106] "Hubballi"           "Kalaburagi"         "Mysuru"            
## [109] "Ramanagara"         "Vijayapura"         "Yadgir"            
## [112] "Eloor"              "Ernakulam"          "Kannur"            
## [115] "Kochi"              "Kollam"             "Kozhikode"         
## [118] "Thiruvananthapuram" "Thiruvananthapuram" "Bhopal"            
## [121] "Damoh"              "Dewas"              "Gwalior"           
## [124] "Gwalior"            "Indore"             "Jabalpur"          
## [127] "Katni"              "Maihar"             "Mandideep"         
## [130] "Pithampur"          "Ratlam"             "Sagar"             
## [133] "Satna"              "Singrauli"          "Ujjain"            
## [136] "Aurangabad"         "Chandrapur"         "Chandrapur"        
## [139] "Kalyan"             "Mumbai"             "Mumbai"            
## [142] "Mumbai"             "Mumbai"             "Mumbai"            
## [145] "Mumbai"             "Mumbai"             "Mumbai"            
## [148] "Mumbai"             "Mumbai"             "Nagpur"            
## [151] "Nashik"             "Navi Mumbai"        "Navi Mumbai"       
## [154] "Navi Mumbai"        "Pune"               "Solapur"           
## [157] "Thane"              "Shillong"           "Aizawl"            
## [160] "Brajrajnagar"       "Talcher"            "Amritsar"          
## [163] "Bathinda"           "Jalandhar"          "Khanna"            
## [166] "Ludhiana"           "Gobindgarh"         "Patiala"           
## [169] "Rupnagar"           "Alwar"              "Ajmer"             
## [172] "Bhiwandi"           "Jaipur"             "Jaipur"            
## [175] "Jaipur"             "Jodhpur"            "Kota"              
## [178] "Pali"               "Udaipur"            "Chennai"           
## [181] "Chennai"            "Chennai"            "Chennai"           
## [184] "Coimbatore"         "Hyderabad"          "Hyderabad"         
## [187] "Hyderabad"          "Hyderabad"          "Hyderabad"         
## [190] "Hyderabad"          "Agra"               "Baghpat"           
## [193] "Bulandshahr"        "Ghaziabad"          "Ghaziabad"         
## [196] "Ghaziabad"          "Ghaziabad"          "Greater Noida"     
## [199] "Greater Noida"      "Hapur"              "Kanpur"            
## [202] "Lucknow"            "Lucknow"            "Lucknow"           
## [205] "Lucknow"            "Lucknow"            "Meerut"            
## [208] "Meerut"             "Meerut"             "Moradabad"         
## [211] "Muzzaffarnagar"     "Noida"              "Noida"             
## [214] "Noida"              "Noida"              "Varanasi"          
## [217] "Asansol"            "Durgapur"           "Haldia"            
## [220] "Howrah"             "Howrah"             "Howrah"            
## [223] "Kolkata"            "Kolkata"            "Kolkata"           
## [226] "Kolkata"            "Kolkata"            "Kolkata"           
## [229] "Kolkata"            "Siliguri"
aq_delhi_stations<-stations %>%filter(City=="Delhi")
aq_delhi_stations
aq_mumbai_stations<-stations %>%filter(City=="Mumbai")
aq_mumbai_stations
aq_chennai_stations<-stations %>%filter(City=="Chennai")
aq_chennai_stations
aq_bengaluru_stations<-stations %>%filter(City=="Bengaluru")
aq_bengaluru_stations
head(station_hour)
summary(station_hour)
##   StationId           Datetime             PM2.5             PM10        
##  Length:2589083     Length:2589083     Min.   :   0.0   Min.   :   0.0   
##  Class :character   Class :character   1st Qu.:  28.2   1st Qu.:  64.0   
##  Mode  :character   Mode  :character   Median :  52.6   Median : 116.2   
##                                        Mean   :  80.9   Mean   : 158.5   
##                                        3rd Qu.:  97.7   3rd Qu.: 204.0   
##                                        Max.   :1000.0   Max.   :1000.0   
##                                        NA's   :647689   NA's   :1119252  
##        NO              NO2              NOx              NH3         
##  Min.   :  0.0    Min.   :  0.0    Min.   :  0.0    Min.   :  0.0    
##  1st Qu.:  3.0    1st Qu.: 13.1    1st Qu.: 11.3    1st Qu.: 11.2    
##  Median :  7.2    Median : 24.8    Median : 22.9    Median : 22.4    
##  Mean   : 22.8    Mean   : 35.2    Mean   : 40.6    Mean   : 28.7    
##  3rd Qu.: 18.6    3rd Qu.: 45.5    3rd Qu.: 45.7    3rd Qu.: 37.8    
##  Max.   :500.0    Max.   :500.0    Max.   :500.0    Max.   :500.0    
##  NA's   :553711   NA's   :528973   NA's   :490808   NA's   :1236618  
##        CO              SO2               O3            Benzene      
##  Min.   :  0.0    Min.   :  0.0    Min.   :  0.0    Min.   :  0.0   
##  1st Qu.:  0.4    1st Qu.:  4.2    1st Qu.: 11.0    1st Qu.:  0.1   
##  Median :  0.8    Median :  8.2    Median : 24.8    Median :  1.0   
##  Mean   :  1.5    Mean   : 12.1    Mean   : 38.1    Mean   :  3.3   
##  3rd Qu.:  1.4    3rd Qu.: 14.5    3rd Qu.: 49.5    3rd Qu.:  3.2   
##  Max.   :498.6    Max.   :200.0    Max.   :997.0    Max.   :498.1   
##  NA's   :499302   NA's   :742737   NA's   :725973   NA's   :861579  
##     Toluene            Xylene             AQI          AQI_Bucket       
##  Min.   :  0.0     Min.   :  0.0     Min.   :   5.0   Length:2589083    
##  1st Qu.:  0.3     1st Qu.:  0.0     1st Qu.:  84.0   Class :character  
##  Median :  3.4     Median :  0.2     Median : 131.0   Mode  :character  
##  Mean   : 14.9     Mean   :  2.4     Mean   : 180.2                     
##  3rd Qu.: 15.1     3rd Qu.:  1.8     3rd Qu.: 259.0                     
##  Max.   :500.0     Max.   :500.0     Max.   :3133.0                     
##  NA's   :1042366   NA's   :2075104   NA's   :570190
maxna <-max(rowSums(is.na(station_hour)))
cd_rowswithmaxna<-subset(station_hour, rowSums(is.na(station_hour)) == maxna)
#cd_rowswithmorena
head(cd_rowswithmaxna)
station_hour<-subset(station_hour, rowSums(is.na(station_hour)) != maxna)
#station_hour
head(station_hour)
station_hour <- replace(station_hour, station_hour=="", NA)%>% drop_na(AQI_Bucket)
station_hour$AQI_Bucket <- as.factor(station_hour$AQI_Bucket)
#str(station_hour)
summary(station_hour)
##   StationId           Datetime             PM2.5              PM10       
##  Length:2018893     Length:2018893     Min.   :   0.01   Min.   :   0.0  
##  Class :character   Class :character   1st Qu.:  28.25   1st Qu.:  64.0  
##  Mode  :character   Mode  :character   Median :  52.72   Median : 116.5  
##                                        Mean   :  80.76   Mean   : 158.8  
##                                        3rd Qu.:  97.64   3rd Qu.: 204.5  
##                                        Max.   :1000.00   Max.   :1000.0  
##                                        NA's   :138787    NA's   :597086  
##        NO              NO2              NOx              NH3        
##  Min.   :  0.01   Min.   :  0.01   Min.   :  0.00   Min.   :  0.0   
##  1st Qu.:  3.00   1st Qu.: 13.15   1st Qu.: 12.66   1st Qu.: 11.4   
##  Median :  7.05   Median : 24.80   Median : 24.13   Median : 22.5   
##  Mean   : 22.89   Mean   : 35.13   Mean   : 42.52   Mean   : 28.6   
##  3rd Qu.: 18.20   3rd Qu.: 45.41   3rd Qu.: 47.33   3rd Qu.: 37.8   
##  Max.   :500.00   Max.   :499.99   Max.   :500.00   Max.   :500.0   
##  NA's   :124498   NA's   :99059    NA's   :157440   NA's   :734261  
##        CO              SO2               O3            Benzene      
##  Min.   :  0.00   Min.   :  0.01   Min.   :  0.01   Min.   :  0.0   
##  1st Qu.:  0.48   1st Qu.:  4.30   1st Qu.: 10.92   1st Qu.:  0.1   
##  Median :  0.84   Median :  8.35   Median : 24.73   Median :  1.1   
##  Mean   :  1.53   Mean   : 12.15   Mean   : 38.27   Mean   :  3.5   
##  3rd Qu.:  1.42   3rd Qu.: 14.62   3rd Qu.: 49.95   3rd Qu.:  3.4   
##  Max.   :498.57   Max.   :199.96   Max.   :997.00   Max.   :498.1   
##  NA's   :170834   NA's   :294848   NA's   :274286   NA's   :510708  
##     Toluene           Xylene             AQI                AQI_Bucket    
##  Min.   :  0.0    Min.   :  0.0     Min.   :   5.0   Good        :152113  
##  1st Qu.:  0.8    1st Qu.:  0.0     1st Qu.:  84.0   Moderate    :675008  
##  Median :  4.2    Median :  0.4     Median : 131.0   Poor        :239990  
##  Mean   : 16.1    Mean   :  2.8     Mean   : 180.2   Satisfactory:530164  
##  3rd Qu.: 16.8    3rd Qu.:  2.1     3rd Qu.: 259.0   Severe      :120468  
##  Max.   :500.0    Max.   :500.0     Max.   :3133.0   Very Poor   :301150  
##  NA's   :670687   NA's   :1585481
head(station_hour)
tail(station_hour)

Extracting the lat, lon and elevation info from geo data. Banglore info was incorrect and it is correct with right values for lat, lon and elevation.

blr_geo<-geo_data %>%filter(Location_Name=="Bangalore")
blr_geo<-blr_geo%>%mutate(longitude=77.59, Latitude=12.97, Elevation=920)
chn_geo<-geo_data%>%filter(Location_Name=="Chennai")
dlhi_geo<-geo_data%>%filter(Location_Name=="Delhi")
mum_geo<-geo_data%>%filter(Location_Name=="Mumbai")
bhu_geo<-geo_data%>%filter(Location_Name=="Bhuvaneswar")
rourkela_geo<-geo_data%>%filter(Location_Name=="Rourkela")
rajas_geo<-geo_data%>%filter(Location_Name=="Rajastan")
lucknw_geo<-geo_data%>%filter(Location_Name=="Lucknow")
nrow(station_day)
## [1] 108035
ncol(station_day)
## [1] 16
head(station_day)
summary(station_day)
##   StationId             Date               PM2.5              PM10        
##  Length:108035      Length:108035      Min.   :   0.02   Min.   :   0.01  
##  Class :character   Class :character   1st Qu.:  31.88   1st Qu.:  70.15  
##  Mode  :character   Mode  :character   Median :  55.95   Median : 122.09  
##                                        Mean   :  80.27   Mean   : 157.97  
##                                        3rd Qu.:  99.92   3rd Qu.: 208.67  
##                                        Max.   :1000.00   Max.   :1000.00  
##                                        NA's   :21625     NA's   :42706    
##        NO              NO2              NOx              NH3        
##  Min.   :  0.01   Min.   :  0.01   Min.   :  0.00   Min.   :  0.01  
##  1st Qu.:  4.84   1st Qu.: 15.09   1st Qu.: 13.97   1st Qu.: 11.90  
##  Median : 10.29   Median : 27.21   Median : 26.66   Median : 23.59  
##  Mean   : 23.12   Mean   : 35.24   Mean   : 41.20   Mean   : 28.73  
##  3rd Qu.: 24.98   3rd Qu.: 46.93   3rd Qu.: 50.50   3rd Qu.: 38.14  
##  Max.   :470.00   Max.   :448.05   Max.   :467.63   Max.   :418.90  
##  NA's   :17106    NA's   :16547    NA's   :15500    NA's   :48105   
##        CO               SO2               O3            Benzene       
##  Min.   :  0.000   Min.   :  0.01   Min.   :  0.01   Min.   :  0.000  
##  1st Qu.:  0.530   1st Qu.:  5.04   1st Qu.: 18.89   1st Qu.:  0.160  
##  Median :  0.910   Median :  8.95   Median : 30.84   Median :  1.210  
##  Mean   :  1.606   Mean   : 12.26   Mean   : 38.13   Mean   :  3.358  
##  3rd Qu.:  1.450   3rd Qu.: 14.92   3rd Qu.: 47.14   3rd Qu.:  3.610  
##  Max.   :175.810   Max.   :195.65   Max.   :963.00   Max.   :455.030  
##  NA's   :12998     NA's   :25204    NA's   :25568    NA's   :31455    
##     Toluene           Xylene            AQI          AQI_Bucket       
##  Min.   :  0.00   Min.   :  0.00   Min.   :   8.0   Length:108035     
##  1st Qu.:  0.69   1st Qu.:  0.00   1st Qu.:  86.0   Class :character  
##  Median :  4.33   Median :  0.40   Median : 132.0   Mode  :character  
##  Mean   : 15.35   Mean   :  2.42   Mean   : 179.7                     
##  3rd Qu.: 17.51   3rd Qu.:  2.11   3rd Qu.: 254.0                     
##  Max.   :454.85   Max.   :170.37   Max.   :2049.0                     
##  NA's   :38702    NA's   :85137    NA's   :21010
maxna <-max(rowSums(is.na(station_day)))
cd_rowswithmaxna<-subset(station_day, rowSums(is.na(station_day)) == maxna)
#cd_rowswithmorena
head(cd_rowswithmaxna)
station_day<-subset(station_day, rowSums(is.na(station_day)) != maxna)
#station_day
head(station_day)
station_day <- replace(station_day, station_day=="", NA)%>% drop_na(AQI_Bucket)
station_day$AQI_Bucket <- as.factor(station_day$AQI_Bucket)
#str(station_day)
summary(station_day)
##   StationId             Date               PM2.5              PM10       
##  Length:87025       Length:87025       Min.   :   0.04   Min.   :  0.03  
##  Class :character   Class :character   1st Qu.:  32.02   1st Qu.: 70.52  
##  Mode  :character   Mode  :character   Median :  56.17   Median :122.43  
##                                        Mean   :  80.39   Mean   :158.56  
##                                        3rd Qu.: 100.17   3rd Qu.:209.47  
##                                        Max.   :1000.00   Max.   :976.77  
##                                        NA's   :3488      NA's   :23961   
##        NO              NO2              NOx              NH3        
##  Min.   :  0.02   Min.   :  0.01   Min.   :  0.00   Min.   :  0.01  
##  1st Qu.:  4.79   1st Qu.: 15.18   1st Qu.: 15.51   1st Qu.: 12.05  
##  Median : 10.22   Median : 27.24   Median : 28.18   Median : 23.76  
##  Mean   : 23.24   Mean   : 35.12   Mean   : 43.25   Mean   : 28.65  
##  3rd Qu.: 24.92   3rd Qu.: 46.83   3rd Qu.: 52.59   3rd Qu.: 38.15  
##  Max.   :437.85   Max.   :448.05   Max.   :434.90   Max.   :365.68  
##  NA's   :2229     NA's   :1566     NA's   :4555     NA's   :29832   
##        CO               SO2               O3            Benzene       
##  Min.   :  0.000   Min.   :  0.01   Min.   :  0.01   Min.   :  0.000  
##  1st Qu.:  0.600   1st Qu.:  5.09   1st Qu.: 18.92   1st Qu.:  0.260  
##  Median :  0.950   Median :  9.06   Median : 30.89   Median :  1.380  
##  Mean   :  1.616   Mean   : 12.21   Mean   : 38.32   Mean   :  3.567  
##  3rd Qu.:  1.470   3rd Qu.: 14.98   3rd Qu.: 47.45   3rd Qu.:  3.800  
##  Max.   :175.810   Max.   :186.08   Max.   :963.00   Max.   :455.030  
##  NA's   :2896      NA's   :9533     NA's   :9598     NA's   :19787    
##     Toluene           Xylene            AQI                AQI_Bucket   
##  Min.   :  0.00   Min.   :  0.00   Min.   :   8.0   Good        : 5510  
##  1st Qu.:  1.10   1st Qu.:  0.05   1st Qu.:  86.0   Moderate    :29417  
##  Median :  5.31   Median :  0.60   Median : 132.0   Poor        :11493  
##  Mean   : 16.56   Mean   :  2.77   Mean   : 179.7   Satisfactory:23636  
##  3rd Qu.: 19.32   3rd Qu.:  2.66   3rd Qu.: 254.0   Severe      : 5207  
##  Max.   :454.85   Max.   :170.37   Max.   :2049.0   Very Poor   :11762  
##  NA's   :26324    NA's   :67584
head(station_day)
tail(station_day)
findCity<-function(x)
{
  ifelse(x %in% aq_delhi_stations$StationId,"Delhi",
    ifelse(x %in% aq_mumbai_stations$StationId,"Mumbai",
     ifelse(x %in% aq_chennai_stations$StationId,"Chennai",
      ifelse(x %in% aq_bengaluru_stations$StationId,"Bengaluru","Other"))))
}

station_day_city<-station_day%>%mutate(City=findCity(StationId))
#station_day_city

major_city_station_day<-station_day_city%>%filter(City %in% c("Chennai",  "Bengaluru", "Delhi","Mumbai"))
#View(major_city_station_day)

st_na_total_per <-data.frame(colname = names(major_city_station_day),Total_NAs = colSums(is.na(major_city_station_day)), "PerctOfNAs" = colSums((is.na(major_city_station_day)/sum(is.na(major_city_station_day)) * 100)))
#st_na_total_per
ggplot(st_na_total_per, aes(x=colname, y=PerctOfNAs))+geom_col()

#major_city_station_day<-major_city_station_day[complete.cases(major_city_station_day),]
#major_city_station_day

major_city_station_day<-major_city_station_day%>%mutate(Latitude = case_when(City == "Chennai" ~ chn_geo$Latitude, City == "Bengaluru" ~ blr_geo$Latitude, City == "Mumbai" ~ mum_geo$Latitude,City == "Delhi" ~ dlhi_geo$Latitude),Longitude = case_when(City == "Chennai" ~ chn_geo$longitude, City == "Bengaluru" ~ blr_geo$longitude, City == "Mumbai" ~ mum_geo$longitude,City == "Delhi" ~ dlhi_geo$longitude),Elevation = case_when(City == "Chennai" ~ chn_geo$Elevation, City == "Bengaluru" ~ blr_geo$Elevation, City == "Mumbai" ~ mum_geo$Elevation,City == "Delhi" ~ dlhi_geo$Elevation))

#major_city_station_day

station_day data set is cleaned for missing values. Compared to City_day dataset station dataset is better and has less NA values for major pollutants. Unwanted pollutants can be removed from combined data set.

###Performing benfordness check for station_day data set

major_city_station_day_first_digit <-major_city_station_day %>% mutate(across(seq(3:17), firstDigit))#%>%subset(select=-c(1,2,12))
major_city_station_day_first_digit
PM2.5FD_Counts_stday<-as.vector(table(major_city_station_day_first_digit$PM2.5))
PM2.5first_digit_actual_vs_expected_stday <- data.frame(
digit            = 1:9,
actual.count     = PM2.5FD_Counts_stday,    
actual.fraction  = PM2.5FD_Counts_stday / nrow(major_city_station_day_first_digit),
benford.fraction = log10(1 + 1 / (1:9))
)
PM2.5first_digit_actual_vs_expected_stday<-PM2.5first_digit_actual_vs_expected_stday %>%mutate(difference.fraction=abs(benford.fraction-actual.fraction))
PM2.5first_digit_actual_vs_expected_stday
BenfordConfirmity(mean(PM2.5first_digit_actual_vs_expected_stday$difference.fraction), "PM2.5")
## [1] "PM2.5 Dataset has Acceptable confirmity with Mean abosolute Deviation 0.0083"
AQIFD_Counts_stday<-as.vector(table(major_city_station_day_first_digit$AQI))
AQIfirst_digit_actual_vs_expected_stday <- data.frame(
digit            = 1:9,
actual.count     = AQIFD_Counts_stday,    
actual.fraction  = AQIFD_Counts_stday / nrow(major_city_station_day_first_digit),
benford.fraction = log10(1 + 1 / (1:9))
)
AQIfirst_digit_actual_vs_expected_stday<-AQIfirst_digit_actual_vs_expected_stday %>%mutate(difference.fraction=abs(benford.fraction-actual.fraction))
AQIfirst_digit_actual_vs_expected_stday
BenfordConfirmity(mean(AQIfirst_digit_actual_vs_expected_stday$difference.fraction), "AQI")
## [1] "AQI Dataset has no confirmity with Mean abosolute Deviation 0.019"
NOFD_Counts_stday<-as.vector(table(major_city_station_day_first_digit$NO))
NOfirst_digit_actual_vs_expected_stday <- data.frame(
digit            = 1:9,
actual.count     = NOFD_Counts_stday,    
actual.fraction  = NOFD_Counts_stday / nrow(major_city_station_day_first_digit),
benford.fraction = log10(1 + 1 / (1:9))
)
NOfirst_digit_actual_vs_expected_stday<-NOfirst_digit_actual_vs_expected_stday %>%mutate(difference.fraction=abs(benford.fraction-actual.fraction))
NOfirst_digit_actual_vs_expected_stday
BenfordConfirmity(mean(NOfirst_digit_actual_vs_expected_stday$difference.fraction), "NO")
## [1] "NO Dataset has Close confirmity with Mean abosolute Deviation 0.0058"
NO2FD_Counts_stday<-as.vector(table(major_city_station_day_first_digit$NO2))
NO2first_digit_actual_vs_expected_stday <- data.frame(
digit            = 1:9,
actual.count     = NO2FD_Counts_stday,    
actual.fraction  = NO2FD_Counts_stday / nrow(major_city_station_day_first_digit),
benford.fraction = log10(1 + 1 / (1:9))
)
NO2first_digit_actual_vs_expected_stday<-NO2first_digit_actual_vs_expected_stday %>%mutate(difference.fraction=abs(benford.fraction-actual.fraction))
NO2first_digit_actual_vs_expected_stday
BenfordConfirmity(mean(NO2first_digit_actual_vs_expected_stday$difference.fraction), "NO2")
## [1] "NO2 Dataset has no confirmity with Mean abosolute Deviation 0.016"
COFD_Counts_stday<-as.vector(table(major_city_station_day_first_digit$CO))
COfirst_digit_actual_vs_expected_stday <- data.frame(
digit            = 1:9,
actual.count     = COFD_Counts_stday,    
actual.fraction  = COFD_Counts_stday / nrow(major_city_station_day_first_digit),
benford.fraction = log10(1 + 1 / (1:9))
)
COfirst_digit_actual_vs_expected_stday<-COfirst_digit_actual_vs_expected_stday %>%mutate(difference.fraction=abs(benford.fraction-actual.fraction))
COfirst_digit_actual_vs_expected_stday
BenfordConfirmity(mean(COfirst_digit_actual_vs_expected_stday$difference.fraction), "CO")
## [1] "CO Dataset has no confirmity with Mean abosolute Deviation 0.035"
SO2FD_Counts_stday<-as.vector(table(major_city_station_day_first_digit$SO2))
SO2first_digit_actual_vs_expected_stday <- data.frame(
digit            = 1:9,
actual.count     = SO2FD_Counts_stday,    
actual.fraction  = SO2FD_Counts_stday / nrow(major_city_station_day_first_digit),
benford.fraction = log10(1 + 1 / (1:9))
)
SO2first_digit_actual_vs_expected_stday<-NO2first_digit_actual_vs_expected_stday %>%mutate(difference.fraction=abs(benford.fraction-actual.fraction))
SO2first_digit_actual_vs_expected_stday
BenfordConfirmity(mean(SO2first_digit_actual_vs_expected_stday$difference.fraction), "SO2")
## [1] "SO2 Dataset has no confirmity with Mean abosolute Deviation 0.016"
NOXFD_Counts_stday<-as.vector(table(major_city_station_day_first_digit$NOX))
NOXfirst_digit_actual_vs_expected_stday <- data.frame(
digit            = 1:9,
actual.count     = SO2FD_Counts_stday,    
actual.fraction  = SO2FD_Counts_stday / nrow(major_city_station_day_first_digit),
benford.fraction = log10(1 + 1 / (1:9))
)
NOXfirst_digit_actual_vs_expected_stday<-NOXfirst_digit_actual_vs_expected_stday %>%mutate(difference.fraction=abs(benford.fraction-actual.fraction))
NOXfirst_digit_actual_vs_expected_stday
BenfordConfirmity(mean(NOXfirst_digit_actual_vs_expected_stday$difference.fraction), "NOX")
## [1] "NOX Dataset has no confirmity with Mean abosolute Deviation 0.017"

“PM2.5 Dataset has Acceptable confirmity with Mean abosolute Deviation 0.0083” has accepatable confirmity and “NO Dataset has Close confirmity with Mean abosolute Deviation 0.0058” Rest has no conformity with minor variation. [1] “AQI Dataset has no confirmity with Mean abosolute Deviation 0.019” [1] “NO2 Dataset has no confirmity with Mean abosolute Deviation 0.016” [1] “CO Dataset has no confirmity with Mean abosolute Deviation 0.035” [1] “SO2 Dataset has no confirmity with Mean abosolute Deviation 0.016” [1] “NOX Dataset has no confirmity with Mean abosolute Deviation 0.017”

head(blr_wthr)
summary(blr_wthr)
##      time                tavg            tmin            tmax      
##  Length:11894       Min.   :17.20   Min.   : 9.30   Min.   :19.80  
##  Class :character   1st Qu.:22.30   1st Qu.:18.10   1st Qu.:27.90  
##  Mode  :character   Median :23.50   Median :19.80   Median :29.50  
##                     Mean   :23.84   Mean   :19.39   Mean   :29.93  
##                     3rd Qu.:25.20   3rd Qu.:20.80   3rd Qu.:32.00  
##                     Max.   :32.40   Max.   :27.90   Max.   :39.20  
##                     NA's   :70      NA's   :1389    NA's   :629    
##       prcp        
##  Min.   :  0.000  
##  1st Qu.:  0.000  
##  Median :  0.000  
##  Mean   :  4.414  
##  3rd Qu.:  2.000  
##  Max.   :271.300  
##  NA's   :4620
head(chn_wthr)
summary(chn_wthr)
##      time                tavg            tmin            tmax      
##  Length:11894       Min.   :20.90   Min.   :12.00   Min.   :23.80  
##  Class :character   1st Qu.:26.30   1st Qu.:22.60   1st Qu.:31.10  
##  Mode  :character   Median :28.70   Median :24.60   Median :34.00  
##                     Mean   :28.49   Mean   :24.38   Mean   :33.91  
##                     3rd Qu.:30.40   3rd Qu.:26.40   3rd Qu.:36.20  
##                     Max.   :36.60   Max.   :31.00   Max.   :44.60  
##                     NA's   :27      NA's   :3084    NA's   :1019   
##       prcp        
##  Min.   :  0.000  
##  1st Qu.:  0.000  
##  Median :  0.000  
##  Mean   :  6.244  
##  3rd Qu.:  3.000  
##  Max.   :344.900  
##  NA's   :4886
head(dlhi_wthr)
summary(dlhi_wthr)
##      time                tavg           tmin            tmax      
##  Length:11894       Min.   : 6.6   Min.   : 0.10   Min.   : 9.80  
##  Class :character   1st Qu.:18.5   1st Qu.:11.80   1st Qu.:26.70  
##  Mode  :character   Median :27.0   Median :20.00   Median :33.20  
##                     Mean   :25.0   Mean   :18.88   Mean   :31.79  
##                     3rd Qu.:30.9   3rd Qu.:26.00   3rd Qu.:36.60  
##                     Max.   :39.8   Max.   :34.20   Max.   :48.10  
##                     NA's   :94     NA's   :1536    NA's   :533    
##       prcp        
##  Min.   :  0.000  
##  1st Qu.:  0.000  
##  Median :  0.000  
##  Mean   :  3.662  
##  3rd Qu.:  0.500  
##  Max.   :262.900  
##  NA's   :6140
head(lucknw_wthr)
summary(lucknw_wthr)
##      time                tavg            tmin           tmax      
##  Length:11894       Min.   : 5.70   Min.   :-0.6   Min.   :11.10  
##  Class :character   1st Qu.:19.50   1st Qu.:12.5   1st Qu.:28.10  
##  Mode  :character   Median :27.20   Median :20.5   Median :33.40  
##                     Mean   :25.22   Mean   :18.8   Mean   :32.49  
##                     3rd Qu.:30.40   3rd Qu.:25.1   3rd Qu.:36.50  
##                     Max.   :39.70   Max.   :32.7   Max.   :47.30  
##                     NA's   :138     NA's   :3515   NA's   :1553   
##       prcp        
##  Min.   :  0.000  
##  1st Qu.:  0.000  
##  Median :  0.000  
##  Mean   :  4.536  
##  3rd Qu.:  1.000  
##  Max.   :470.900  
##  NA's   :6152
head(mum_wthr)
summary(mum_wthr)
##      time                tavg            tmin            tmax      
##  Length:11894       Min.   :17.70   Min.   : 8.50   Min.   :22.30  
##  Class :character   1st Qu.:26.60   1st Qu.:19.80   1st Qu.:30.90  
##  Mode  :character   Median :28.10   Median :23.70   Median :32.40  
##                     Mean   :27.76   Mean   :22.62   Mean   :32.31  
##                     3rd Qu.:29.30   3rd Qu.:25.40   3rd Qu.:33.90  
##                     Max.   :33.70   Max.   :30.40   Max.   :41.30  
##                     NA's   :11      NA's   :2454    NA's   :1907   
##       prcp       
##  Min.   :  0.00  
##  1st Qu.:  0.00  
##  Median :  0.00  
##  Mean   : 10.94  
##  3rd Qu.:  7.10  
##  Max.   :461.00  
##  NA's   :4681
head(geo_data)
head(bhu_wthr)
summary(bhu_wthr)
##      time                tavg            tmin            tmax     
##  Length:11935       Min.   :15.70   Min.   : 8.20   Min.   :19.4  
##  Class :character   1st Qu.:24.70   1st Qu.:19.00   1st Qu.:30.4  
##  Mode  :character   Median :27.70   Median :24.00   Median :32.8  
##                     Mean   :26.99   Mean   :22.24   Mean   :33.0  
##                     3rd Qu.:29.40   3rd Qu.:25.60   3rd Qu.:35.4  
##                     Max.   :37.40   Max.   :31.80   Max.   :46.7  
##                     NA's   :78      NA's   :2090    NA's   :891   
##       prcp           snow              wdir            wspd       
##  Min.   :  0.000   Mode:logical   Min.   :  0.0   Min.   : 0.500  
##  1st Qu.:  0.000   NA's:11935     1st Qu.: 89.0   1st Qu.: 4.500  
##  Median :  0.000                  Median :188.0   Median : 7.000  
##  Mean   :  7.074                  Mean   :169.1   Mean   : 8.399  
##  3rd Qu.:  4.100                  3rd Qu.:220.8   3rd Qu.:11.000  
##  Max.   :470.900                  Max.   :359.0   Max.   :33.100  
##  NA's   :5097                     NA's   :10641   NA's   :9806    
##    wpgt              pres          tsun        
##  Mode:logical   Min.   : 990.6   Mode:logical  
##  NA's:11935     1st Qu.:1002.9   NA's:11935    
##                 Median :1007.3                 
##                 Mean   :1007.4                 
##                 3rd Qu.:1012.4                 
##                 Max.   :1019.3                 
##                 NA's   :10692
head(rourkela_wthr)
summary(rourkela_wthr)
##      time                tavg            tmin            tmax      
##  Length:426         Min.   :14.60   Min.   : 8.20   Min.   :21.50  
##  Class :character   1st Qu.:24.40   1st Qu.:18.18   1st Qu.:29.60  
##  Mode  :character   Median :28.10   Median :25.20   Median :32.10  
##                     Mean   :26.71   Mean   :22.30   Mean   :32.25  
##                     3rd Qu.:29.30   3rd Qu.:26.10   3rd Qu.:33.80  
##                     Max.   :35.00   Max.   :29.30   Max.   :43.60  
##                     NA's   :2       NA's   :2       NA's   :2      
##       prcp           snow              wdir            wspd       
##  Min.   :  0.000   Mode:logical   Min.   :  0.0   Min.   : 2.900  
##  1st Qu.:  0.000   NA's:426       1st Qu.: 49.0   1st Qu.: 5.500  
##  Median :  0.200                  Median :168.0   Median : 6.600  
##  Mean   :  5.695                  Mean   :140.3   Mean   : 7.441  
##  3rd Qu.:  7.200                  3rd Qu.:195.2   3rd Qu.: 8.725  
##  Max.   :123.000                  Max.   :359.0   Max.   :20.400  
##  NA's   :3                        NA's   :2       NA's   :2       
##    wpgt              pres          tsun        
##  Mode:logical   Min.   : 993.1   Mode:logical  
##  NA's:426       1st Qu.:1002.5   NA's:426      
##                 Median :1005.5                 
##                 Mean   :1006.8                 
##                 3rd Qu.:1012.1                 
##                 Max.   :1020.6                 
##                 NA's   :2
head(rajas_wthr)
summary(rajas_wthr)
##      time                tavg            tmin            tmax      
##  Length:11894       Min.   :17.20   Min.   : 9.30   Min.   :19.80  
##  Class :character   1st Qu.:22.30   1st Qu.:18.10   1st Qu.:27.90  
##  Mode  :character   Median :23.50   Median :19.80   Median :29.50  
##                     Mean   :23.84   Mean   :19.39   Mean   :29.93  
##                     3rd Qu.:25.20   3rd Qu.:20.80   3rd Qu.:32.00  
##                     Max.   :32.40   Max.   :27.90   Max.   :39.20  
##                     NA's   :70      NA's   :1389    NA's   :629    
##       prcp        
##  Min.   :  0.000  
##  1st Qu.:  0.000  
##  Median :  0.000  
##  Mean   :  4.414  
##  3rd Qu.:  2.000  
##  Max.   :271.300  
##  NA's   :4620

Detailed analysis

Analysis temperature and precipitation over different Seasons for major Cities over different years

bhu_wthr_clnd<-bhu_wthr[complete.cases(bhu_wthr),] %>%filter(between(as.Date(time, format = "%d-%m-%Y"), as.Date('31-12-2014', format = "%d-%m-%Y"), as.Date('01-01-2021', format = "%d-%m-%Y")))%>%mutate(City="Bhuvaneswar", Latitude=bhu_geo$Latitude, Longitude=bhu_geo$longitude, Elevation=bhu_geo$Elevation, Season=season(ymd(as.Date(time))))
#View(bhu_wthr_clnd)


rourkela_wthr_clnd<-rourkela_wthr[complete.cases(rourkela_wthr),] %>%filter(between(as.Date(time, format = "%d-%m-%Y"), as.Date('31-12-2014', format = "%d-%m-%Y"), as.Date('01-01-2021', format = "%d-%m-%Y")))%>%mutate(City="Rourkela", Latitude=rourkela_geo$Latitude, Longitude=rourkela_geo$longitude, Elevation=rourkela_geo$Elevation, Season=season(ymd(as.Date(time))))
#View(rourkela_wthr_clnd)

rajas_wthr_clnd<-rajas_wthr[complete.cases(rajas_wthr),]%>%filter(between(as.Date(time, format = "%d-%m-%Y"), as.Date('31-12-2014', format = "%d-%m-%Y"), as.Date('01-01-2021', format = "%d-%m-%Y")))%>%mutate(City="Rajasthan", Latitude=rajas_geo$Latitude, Longitude=rajas_geo$longitude, Elevation=rajas_geo$Elevation, Season=season(ymd(as.Date(time))))
#View(rajas_wthr_clnd)
rajas_wthr_clnd_seasn<-rajas_wthr_clnd %>% mutate(Year = lubridate::year(as.Date(time, format = "%d-%m-%Y"))) %>% group_by(City, Year, Season) %>% summarise(across(c(2:5), mean, na.rm = TRUE))
## `summarise()` has grouped output by 'City', 'Year'. You can override using the
## `.groups` argument.
#rajas_wthr_clnd_seasn

lucknw_wthr_clnd<-lucknw_wthr[complete.cases(lucknw_wthr),] %>%filter(between(as.Date(time, format = "%d-%m-%Y"), as.Date('31-12-2014', format = "%d-%m-%Y"), as.Date('01-01-2021', format = "%d-%m-%Y")))%>%mutate(City="Lucknow", Latitude=lucknw_geo$Latitude, Longitude=lucknw_geo$longitude, Elevation=lucknw_geo$Elevation, Season=season(ymd(as.Date(time))))
#View(lucknw_wthr_clnd)
lucknw_wthr_clnd_seasn<-lucknw_wthr_clnd %>% mutate(Year = lubridate::year(as.Date(time, format = "%d-%m-%Y"))) %>% group_by(City, Year, Season) %>% summarise(across(c(2:5), mean, na.rm = TRUE))
## `summarise()` has grouped output by 'City', 'Year'. You can override using the
## `.groups` argument.
#lucknw_wthr_clnd_seasn


blr_wthr_clnd<-blr_wthr[complete.cases(blr_wthr),] %>% filter(between(as.Date(time, format = "%d-%m-%Y"), as.Date('31-12-2014', format = "%d-%m-%Y"), as.Date('01-01-2021', format = "%d-%m-%Y")))%>%mutate(City="Bengaluru", Latitude=blr_geo$Latitude, Longitude=blr_geo$longitude, Elevation=blr_geo$Elevation, Season=season(ymd(as.Date(time))))
#View(blr_wthr_clnd)
blr_wthr_clnd_seasn<-blr_wthr_clnd %>% mutate(Year = lubridate::year(as.Date(time, format = "%d-%m-%Y"))) %>% group_by(City,Latitude,Longitude, Elevation,Year, Season) %>% summarise(across(c(2:5), mean, na.rm = TRUE))
## `summarise()` has grouped output by 'City', 'Latitude', 'Longitude',
## 'Elevation', 'Year'. You can override using the `.groups` argument.
#blr_wthr_clnd_seasn
names(blr_wthr_clnd_seasn)
##  [1] "City"      "Latitude"  "Longitude" "Elevation" "Year"      "Season"   
##  [7] "tavg"      "tmin"      "tmax"      "prcp"
ggplot(blr_wthr_clnd_seasn, aes(x=Year))+ geom_line(aes(y=blr_wthr_clnd_seasn$tmax, color=blr_wthr_clnd_seasn$Season)) + scale_x_continuous(name ="year", breaks = seq(2015, 2020, 1))

ggplot(blr_wthr_clnd_seasn, aes(x=Year))+ geom_line(aes(y=blr_wthr_clnd_seasn$tavg, color=blr_wthr_clnd_seasn$Season))+ scale_x_continuous(name ="year", breaks = seq(2015, 2020, 1))

ggplot(blr_wthr_clnd_seasn, aes(x=Year))+ geom_line(aes(y=blr_wthr_clnd_seasn$tmin, color=blr_wthr_clnd_seasn$Season))+ scale_x_continuous(name ="year", breaks = seq(2015, 2020, 1))

ggplot(blr_wthr_clnd_seasn, aes(x=Year))+ geom_line(aes(y=blr_wthr_clnd_seasn$prcp, color=blr_wthr_clnd_seasn$Season))+ scale_x_continuous(name ="year", breaks = seq(2015, 2020, 1))

chn_wthr_clnd<-chn_wthr[complete.cases(chn_wthr),] %>% filter(between(as.Date(time, format = "%d-%m-%Y"), as.Date('31-12-2014', format = "%d-%m-%Y"), as.Date('01-01-2021', format = "%d-%m-%Y")))%>%mutate(City="Chennai", Latitude=chn_geo$Latitude, Longitude=chn_geo$longitude, Elevation=chn_geo$Elevation, Season=season(ymd(as.Date(time))))
#View(chn_wthr_clnd)
chn_wthr_clnd_seasn<-chn_wthr_clnd %>% mutate(Year = lubridate::year(as.Date(time, format = "%d-%m-%Y"))) %>% group_by(City,Latitude,Longitude, Elevation,Year, Season) %>% summarise(across(c(2:5), mean, na.rm = TRUE))
## `summarise()` has grouped output by 'City', 'Latitude', 'Longitude',
## 'Elevation', 'Year'. You can override using the `.groups` argument.
#chn_wthr_clnd_seasn

ggplot(chn_wthr_clnd_seasn, aes(x=Year))+ geom_line(aes(y=chn_wthr_clnd_seasn$tmax, color=chn_wthr_clnd_seasn$Season)) + scale_x_continuous(name ="year", breaks = seq(2015, 2020, 1))

ggplot(chn_wthr_clnd_seasn, aes(x=Year))+ geom_line(aes(y=chn_wthr_clnd_seasn$tavg, color=chn_wthr_clnd_seasn$Season))+ scale_x_continuous(name ="year", breaks = seq(2015, 2020, 1))

ggplot(chn_wthr_clnd_seasn, aes(x=Year))+ geom_line(aes(y=chn_wthr_clnd_seasn$tmin, color=chn_wthr_clnd_seasn$Season))+ scale_x_continuous(name ="year", breaks = seq(2015, 2020, 1))

ggplot(chn_wthr_clnd_seasn, aes(x=Year))+ geom_line(aes(y=chn_wthr_clnd_seasn$prcp, color=chn_wthr_clnd_seasn$Season))+ scale_x_continuous(name ="year", breaks = seq(2015, 2020, 1))

dlhi_wthr_clnd<-dlhi_wthr[complete.cases(dlhi_wthr),] %>%filter(between(as.Date(time, format = "%d-%m-%Y"), as.Date('31-12-2014', format = "%d-%m-%Y"), as.Date('01-01-2021', format = "%d-%m-%Y")))%>%mutate(City="Delhi", Latitude=dlhi_geo$Latitude, Longitude=dlhi_geo$longitude, Elevation=dlhi_geo$Elevation, Season=season(ymd(as.Date(time))))
#View(dlhi_wthr_clnd)
dlhi_wthr_clnd_seasn<-dlhi_wthr_clnd %>% mutate(Year = lubridate::year(as.Date(time, format = "%d-%m-%Y"))) %>% group_by(City,Latitude,Longitude, Elevation,Year, Season) %>% summarise(across(c(2:5), mean, na.rm = TRUE))
## `summarise()` has grouped output by 'City', 'Latitude', 'Longitude',
## 'Elevation', 'Year'. You can override using the `.groups` argument.
#dlhi_wthr_clnd_seasn

ggplot(dlhi_wthr_clnd_seasn, aes(x=Year))+ geom_line(aes(y=dlhi_wthr_clnd_seasn$tmax, color=dlhi_wthr_clnd_seasn$Season)) + scale_x_continuous(name ="year", breaks = seq(2015, 2020, 1))

ggplot(dlhi_wthr_clnd_seasn, aes(x=Year))+ geom_line(aes(y=dlhi_wthr_clnd_seasn$tavg, color=dlhi_wthr_clnd_seasn$Season))+ scale_x_continuous(name ="year", breaks = seq(2015, 2020, 1))

ggplot(dlhi_wthr_clnd_seasn, aes(x=Year))+ geom_line(aes(y=dlhi_wthr_clnd_seasn$tmin, color=dlhi_wthr_clnd_seasn$Season))+ scale_x_continuous(name ="year", breaks = seq(2015, 2020, 1))

ggplot(dlhi_wthr_clnd_seasn, aes(x=Year))+ geom_line(aes(y=dlhi_wthr_clnd_seasn$prcp, color=dlhi_wthr_clnd_seasn$Season))+ scale_x_continuous(name ="year", breaks = seq(2015, 2020, 1))

mum_wthr_clnd<-mum_wthr[complete.cases(mum_wthr),] %>%filter(between(as.Date(time, format = "%d-%m-%Y"), as.Date('31-12-2014', format = "%d-%m-%Y"), as.Date('01-01-2021', format = "%d-%m-%Y")))%>%mutate(City="Mumbai", Latitude=mum_geo$Latitude, Longitude=mum_geo$longitude, Elevation=mum_geo$Elevation, Season=season(ymd(as.Date(time))))
mum_wthr_clnd_seasn<-mum_wthr_clnd %>% mutate(Year = lubridate::year(as.Date(time, format = "%d-%m-%Y"))) %>% group_by(City,Latitude,Longitude, Elevation,Year, Season) %>% summarise(across(c(2:5), mean, na.rm = TRUE))
## `summarise()` has grouped output by 'City', 'Latitude', 'Longitude',
## 'Elevation', 'Year'. You can override using the `.groups` argument.
#mum_wthr_clnd_seasn


ggplot(mum_wthr_clnd_seasn, aes(x=Year))+ geom_line(aes(y=mum_wthr_clnd_seasn$tmax, color=mum_wthr_clnd_seasn$Season)) + scale_x_continuous(name ="year", breaks = seq(2015, 2020, 1))

ggplot(mum_wthr_clnd_seasn, aes(x=Year))+ geom_line(aes(y=mum_wthr_clnd_seasn$tavg, color=mum_wthr_clnd_seasn$Season))+ scale_x_continuous(name ="year", breaks = seq(2015, 2020, 1))

ggplot(mum_wthr_clnd_seasn, aes(x=Year))+ geom_line(aes(y=mum_wthr_clnd_seasn$tmin, color=mum_wthr_clnd_seasn$Season))+ scale_x_continuous(name ="year", breaks = seq(2015, 2020, 1))

ggplot(mum_wthr_clnd_seasn, aes(x=Year))+ geom_line(aes(y=mum_wthr_clnd_seasn$prcp, color=mum_wthr_clnd_seasn$Season))+ scale_x_continuous(name ="year", breaks = seq(2015, 2020, 1))

Pattern of tmin, tmax, tavg, and prcp has changed over years.

Chennai - tavg can be higher during summer or monsoon, precipitation is higher summer and prewinter.

Bangalore -tavg mostly ver high during summer but lower than chennai, precipitation is high during autumn or summer

Mumbai-tavg mostly ver high during summer, precipitation is high during monsoon and autumn

Delhi- tavg can be higher during summer or monsoon, precipitation is high during summer, prewinter and monsoon. It has changed over years.

major_city_station_day_seasonwise<-major_city_station_day%>%mutate(Year=lubridate::year(as.Date(Date)), Season=season(as.Date(Date)))%>%group_by(City,Latitude,Longitude, Elevation,Year, Season)%>% summarise(across(c(2:16), mean, na.rm = TRUE))%>%select(-c(4,18))
## Warning: There were 232 warnings in `summarise()`.
## The first warning was:
## ℹ In argument: `across(c(2:16), mean, na.rm = TRUE)`.
## ℹ In group 1: `City = "Bengaluru"`, `Latitude = 12.97`, `Longitude = 77.59`,
##   `Elevation = 920`, `Year = 2015`, `Season = "Autumn"`.
## Caused by warning in `mean.default()`:
## ! argument is not numeric or logical: returning NA
## ℹ Run `dplyr::last_dplyr_warnings()` to see the 231 remaining warnings.
## `summarise()` has grouped output by 'City', 'Latitude', 'Longitude',
## 'Elevation', 'Year'. You can override using the `.groups` argument.
## Adding missing grouping variables: `Elevation`
#major_city_station_day_seasonwise

list_city_wthr<-list(blr_wthr_clnd_seasn, chn_wthr_clnd_seasn, dlhi_wthr_clnd_seasn, mum_wthr_clnd_seasn)
major_city_wthr<-Reduce(function(x,y) merge(x, y, all=TRUE), list_city_wthr)
#major_city_wthr
     
major_city_AQ_wthr_pos<-merge(major_city_station_day_seasonwise, major_city_wthr, all=TRUE)
summary(major_city_AQ_wthr_pos)
##    Elevation       City              Latitude       Longitude    
##  Min.   :  6   Length:134         Min.   :12.97   Min.   :72.85  
##  1st Qu.:  6   Class :character   1st Qu.:12.97   1st Qu.:77.20  
##  Median :211   Mode  :character   Median :13.07   Median :77.59  
##  Mean   :307                      Mean   :18.38   Mean   :77.28  
##  3rd Qu.:920                      3rd Qu.:28.58   3rd Qu.:80.25  
##  Max.   :920                      Max.   :28.58   Max.   :80.25  
##                                                                  
##       Year         Season               Date         PM2.5       
##  Min.   :2015   Length:134         Min.   : NA   Min.   : 10.44  
##  1st Qu.:2016   Class :character   1st Qu.: NA   1st Qu.: 39.76  
##  Median :2018   Mode  :character   Median : NA   Median : 55.77  
##  Mean   :2018                      Mean   :NaN   Mean   : 74.47  
##  3rd Qu.:2019                      3rd Qu.: NA   3rd Qu.: 82.84  
##  Max.   :2021                      Max.   : NA   Max.   :263.47  
##                                    NA's   :134   NA's   :18      
##       PM10              NO               NO2               NOx        
##  Min.   : 34.84   Min.   :  3.580   Min.   :  8.771   Min.   :  8.84  
##  1st Qu.: 77.48   1st Qu.:  8.567   1st Qu.: 17.060   1st Qu.: 19.40  
##  Median :122.21   Median : 13.289   Median : 30.183   Median : 32.14  
##  Mean   :159.14   Mean   : 23.211   Mean   : 33.550   Mean   : 44.83  
##  3rd Qu.:226.52   3rd Qu.: 29.830   3rd Qu.: 47.472   3rd Qu.: 66.37  
##  Max.   :490.09   Max.   :111.990   Max.   :100.848   Max.   :163.85  
##  NA's   :45       NA's   :18        NA's   :18        NA's   :21      
##       NH3                CO              SO2               O3         
##  Min.   :  6.772   Min.   :0.3914   Min.   : 2.861   Min.   :  8.347  
##  1st Qu.: 20.937   1st Qu.:0.9369   1st Qu.: 5.887   1st Qu.: 28.562  
##  Median : 34.818   Median :1.2465   Median : 9.750   Median : 36.013  
##  Mean   : 45.507   Mean   :1.7999   Mean   :10.867   Mean   : 40.403  
##  3rd Qu.: 53.762   3rd Qu.:1.8860   3rd Qu.:14.126   3rd Qu.: 45.694  
##  Max.   :225.576   Max.   :9.9567   Max.   :36.554   Max.   :111.430  
##  NA's   :29        NA's   :18       NA's   :18       NA's   :18       
##     Benzene             Xylene            AQI           AQI_Bucket 
##  Min.   : 0.02893   Min.   :0.0000   Min.   : 55.14   Min.   : NA  
##  1st Qu.: 0.71293   1st Qu.:0.4828   1st Qu.: 97.99   1st Qu.: NA  
##  Median : 1.71125   Median :1.1136   Median :124.06   Median : NA  
##  Mean   : 3.01885   Mean   :1.6075   Mean   :164.64   Mean   :NaN  
##  3rd Qu.: 4.22737   3rd Qu.:2.3377   3rd Qu.:193.03   3rd Qu.: NA  
##  Max.   :21.55448   Max.   :5.3271   Max.   :457.61   Max.   : NA  
##  NA's   :19         NA's   :107      NA's   :18       NA's   :134  
##       tavg            tmin             tmax            prcp       
##  Min.   :12.67   Min.   : 8.533   Min.   :19.09   Min.   : 0.000  
##  1st Qu.:22.56   1st Qu.:18.683   1st Qu.:28.39   1st Qu.: 1.060  
##  Median :26.37   Median :21.792   Median :30.52   Median : 4.703  
##  Mean   :25.25   Mean   :20.916   Mean   :30.46   Mean   : 6.983  
##  3rd Qu.:28.65   3rd Qu.:24.303   3rd Qu.:34.02   3rd Qu.: 9.111  
##  Max.   :31.90   Max.   :28.950   Max.   :39.00   Max.   :49.060  
##  NA's   :18      NA's   :18       NA's   :18      NA's   :18
#major_city_AQ_wthr_pos
major_city_AQ_wthr_pos_clnd<-major_city_AQ_wthr_pos[,-c(7,9,13,17,18,20)]
#major_city_AQ_wthr_pos_clnd

City_AQ_WTHR_POS<-major_city_AQ_wthr_pos_clnd
City_AQ_WTHR_POS<-City_AQ_WTHR_POS[!is.na(City_AQ_WTHR_POS$PM2.5),]
#City_AQ_WTHR_POS
ggplot(City_AQ_WTHR_POS, aes(x=Year, y=PM2.5, color=Season)) + geom_boxplot(outlier.colour="black", outlier.shape=16,
             outlier.size=2, notch=FALSE) + scale_x_continuous(name ="year", breaks = seq(2015, 2020, 1))

Checked if there PM2.5 outliers in the dataset. There are no outliers.

Plot of major cities in the map with PM2.5 and AQI

cityMap <- City_AQ_WTHR_POS %>% 
  leaflet() %>%
  addTiles() %>%
  addCircleMarkers(lat = City_AQ_WTHR_POS$Latitude, lng = City_AQ_WTHR_POS$Longitude, weight = 5, color = 'midnightblue', radius = sqrt(City_AQ_WTHR_POS$AQI)) %>%
  addPopups(lat = City_AQ_WTHR_POS$Latitude, lng = City_AQ_WTHR_POS$Longitude, popup = paste("PM2.5 ", as.character(round(City_AQ_WTHR_POS$PM2.5, 2)),"\n", "AQI ", as.character(round(City_AQ_WTHR_POS$AQI, 2))))
cityMap

Impact of Season, temperature average, precipitation, Elevation, NO, NO2, NOx

NO, NO2, NOx are considered for analysis as the source of these pollutants and PM2.5 are similar and these pollutants may serve as precursor for PM2.5. Just to limit the scope of analysis only these pollutants and AQI are considered for modeling. To start with their impacts on PM2.5 are studied.

ggplot(City_AQ_WTHR_POS, aes(x = Season, y = PM2.5, color = AQI, size = NO)) +
  geom_point(na.rm=TRUE) +
  labs(title = "Impact of AQI, Season and NO on PM2.5")

Overall PM2.5 is higher in the order of Autumn, Prewinter and winter with higher AQI and NO. Monsoon are concentration of PM2.5 and NO are lower with low AQI.

ggplot(City_AQ_WTHR_POS, aes(x = Season, y = PM2.5, color = AQI, size = NO2)) +
  geom_point(na.rm=TRUE) +
  labs(title = "Impact of AQI, Season and NO2 on PM2.5")

Overall PM2.5 is higher in the order of Autumn, Prewinter and winter with higher AQI and NO2. Monsoon are concentration of PM2.5 and NO2 are lower with low AQI.Summer and spring are moderate

ggplot(City_AQ_WTHR_POS, aes(x = Season, y = PM2.5, color = AQI, size = NOx)) +
  geom_point(na.rm=TRUE) +
  labs(title = "Impact of AQI, Season and NOx on PM2.5")

Overall PM2.5 is higher in the order of Autumn, Prewinter and winter with higher AQI and NOx. Monsoon are concentration of PM2.5 and NOx are lower with low AQI.Summer and spring are moderate

ggplot(City_AQ_WTHR_POS, aes(x = Season, y = PM2.5, color = AQI, size = Elevation)) +
  geom_point(na.rm=TRUE) +
  labs(title = "Impact of AQI, Season and Elevation on PM2.5")

Higher the elevation PM2.5 is low with low AQI. PM2.5 is generally high during autumns, prewinter and winter. moderate during summer and spring. Monsoons low PM2.5.

ggplot(City_AQ_WTHR_POS, aes(x = Season, y = PM2.5, color = tavg , size = prcp, na.rm=TRUE)) +
  geom_point(na.rm=TRUE) +
  labs(title = "Impact of Average Temperature, Season and Precipitation on PM2.5")

Higher the temperature and high precipitation leads to low PM2.5.

#city_hour_sepdt

city_hour_test_data<-city_hour_sepdt%>%mutate(Latitude = case_when(City == "Chennai" ~ chn_geo$Latitude, City == "Bengaluru" ~ blr_geo$Latitude, City == "Mumbai" ~ mum_geo$Latitude,City == "Delhi" ~ dlhi_geo$Latitude),Longitude = case_when(City == "Chennai" ~ chn_geo$longitude, City == "Bengaluru" ~ blr_geo$longitude, City == "Mumbai" ~ mum_geo$longitude,City == "Delhi" ~ dlhi_geo$longitude),Elevation = case_when(City == "Chennai" ~ chn_geo$Elevation, City == "Bengaluru" ~ blr_geo$Elevation, City == "Mumbai" ~ mum_geo$Elevation,City == "Delhi" ~ dlhi_geo$Elevation))

major_city_hour_test_data<-city_hour_test_data %>% filter(City %in% c("Chennai",  "Bengaluru", "Delhi","Mumbai"))  
major_city_hour_test_data
major_city_hour_test_data1<-major_city_hour_test_data%>% group_by(City,Latitude,Longitude, Elevation,Year, Season) %>% summarise(across(c(3:15), mean, na.rm = TRUE))%>%select(-c(12,16,17,18))
## `summarise()` has grouped output by 'City', 'Latitude', 'Longitude',
## 'Elevation', 'Year'. You can override using the `.groups` argument.
#major_city_hour_test_data1

major_city_hour_test_data_temp<-merge(major_city_hour_test_data1, major_city_wthr, all=TRUE)
#major_city_hour_test_data_temp

major_city_hour_test_data_final<-major_city_hour_test_data_temp%>%select(-c(2,3,8,12,13,14,17,18))
#major_city_hour_test_data_final

head(major_city_hour_test_data_final)

PM2.5 Prediction model

11 different with lm and rpart(22 models) separately trained with different explanatory variables. Combined data set of station_day, weather and geo data of Chennai, Bengaluru, Delhi and mumbai is used for training. Combined data with city_day, weather and geo is used as test dataset.

LM Model1: Response Variable: PM2.5 Explanatory Variable(s): NO LM Model2: Response Variable: PM2.5 Explanatory Variable(s): NO + Season LM Model3: Response Variable: PM2.5 Explanatory Variable(s): NO2 + Season LM Model4: Response Variable: PM2.5 Explanatory Variable(s): NOx + Season LM Model5: Response Variable: PM2.5 Explanatory Variable(s): NO+ NO2 + NOx + Season LM Model6: Response Variable: PM2.5 Explanatory Variable(s): NO+ NO2 + NOx + Season + tavg LM Model7: Response Variable: PM2.5 Explanatory Variable(s): NO+ NO2 + NOx + Season + tavg + prcp LM Model8: Response Variable: PM2.5 Explanatory Variable(s): NO+ NO2 + NOx + Season + Elevation LM Model9: Response Variable: PM2.5 Explanatory Variable(s): NO+ NO2 + NOx + Season + tavg + prcp + Elevation LM Model10: Response Variable: PM2.5 Explanatory Variable(s): NO+ NO2 + NOx + Season + tavg + prcp + Elevation + AQI LM Model11: Response Variable: PM2.5 Explanatory Variable(s): AQI

RPART Model1: Response Variable: PM2.5 Explanatory Variable(s): NO RPART Model2: Response Variable: PM2.5 Explanatory Variable(s): NO + Season RPART Model3: Response Variable: PM2.5 Explanatory Variable(s): NO2 + Season RPART Model4: Response Variable: PM2.5 Explanatory Variable(s): NOx + Season RPART Model5: Response Variable: PM2.5 Explanatory Variable(s): NO+ NO2 + NOx + Season RPART Model6: Response Variable: PM2.5 Explanatory Variable(s): NO+ NO2 + NOx + Season + tavg RPART Model7: Response Variable: PM2.5 Explanatory Variable(s): NO+ NO2 + NOx + Season + tavg + prcp RPART Model8: Response Variable: PM2.5 Explanatory Variable(s): NO+ NO2 + NOx + Season + Elevation RPART Model9: Response Variable: PM2.5 Explanatory Variable(s): NO+ NO2 + NOx + Season + tavg + prcp + Elevation RPART Model10: Response Variable: PM2.5 Explanatory Variable(s): NO+ NO2 + NOx + Season + tavg + prcp + Elevation + AQI RPART Model11: Response Variable: PM2.5 Explanatory Variable(s): AQI

lmmodel1_NO<-lm(PM2.5~NO, City_AQ_WTHR_POS)
fmodel(lmmodel1_NO)

rpartmode11_NO<-rpart(PM2.5~NO, data=City_AQ_WTHR_POS, cp=0.01)
fmodel(rpartmode11_NO)

prp(rpartmode11_NO)

lmmodel2_NOSea<-lm(PM2.5~NO+Season, City_AQ_WTHR_POS)
fmodel(lmmodel2_NOSea)

rpartmode12_NOSea<-rpart(PM2.5~NO+Season, data=City_AQ_WTHR_POS, cp=0.01)
fmodel(rpartmode12_NOSea)

prp(rpartmode12_NOSea)

lmmodel3_NO2Sea<-lm(PM2.5~NO2+Season, City_AQ_WTHR_POS)
fmodel(lmmodel3_NO2Sea)

rpartmode13_NO2Sea<-rpart(PM2.5~NO2+Season, data=City_AQ_WTHR_POS, cp=0.01)
fmodel(rpartmode13_NO2Sea)

prp(rpartmode13_NO2Sea)

lmmodel4_NOXSea<-lm(PM2.5~NOx+Season, City_AQ_WTHR_POS)
fmodel(lmmodel4_NOXSea)

rpartmode14_NOXSea<-rpart(PM2.5~NOx+Season, data=City_AQ_WTHR_POS, cp=0.01)
fmodel(rpartmode14_NOXSea)

prp(rpartmode14_NOXSea)

lmmodel5_NOCombSea<-lm(PM2.5~NO+NO2+NOx+Season, City_AQ_WTHR_POS)
fmodel(lmmodel5_NOCombSea)

rpartmode15_NOCombSea<-rpart(PM2.5~NO+NO2+NOx+Season, data=City_AQ_WTHR_POS, cp=0.01)
fmodel(rpartmode15_NOCombSea)

prp(rpartmode15_NOCombSea)

lmmodel6_NOTempCombSea<-lm(PM2.5~NO+NO2+NOx+Season+tavg, City_AQ_WTHR_POS)
fmodel(lmmodel6_NOTempCombSea)

rpartmode16_NOTempCombSea<-rpart(PM2.5~NO+NO2+NOx+Season+tavg, data=City_AQ_WTHR_POS, cp=0.01)
fmodel(rpartmode16_NOTempCombSea)

prp(rpartmode16_NOTempCombSea)

lmmodel7_NOTempCombPrcPSea<-lm(PM2.5~NO+NO2+NOx+Season+tavg+prcp, City_AQ_WTHR_POS)
fmodel(lmmodel7_NOTempCombPrcPSea)

rpartmodel7_NOTempCombPrcPSea<-rpart(PM2.5~NO+NO2+NOx+Season+tavg+prcp, data=City_AQ_WTHR_POS, cp=0.01)
fmodel(rpartmodel7_NOTempCombPrcPSea)

prp(rpartmodel7_NOTempCombPrcPSea)

lmmodel8_NOCombElevPSea<-lm(PM2.5~NO+NO2+NOx+Season+Elevation, City_AQ_WTHR_POS)
fmodel(lmmodel8_NOCombElevPSea)

rpartmodel8_NOCombElevPSea<-rpart(PM2.5~NO+NO2+NOx+NOx+Season+Elevation, data=City_AQ_WTHR_POS, cp=0.01)
fmodel(rpartmodel8_NOCombElevPSea)

prp(rpartmodel8_NOCombElevPSea)

lmmodel9_NOTempCombElevPrcpSea<-lm(PM2.5~NO+NO2+NOx+Season+tavg+prcp+Elevation, City_AQ_WTHR_POS)
fmodel(lmmodel9_NOTempCombElevPrcpSea)

rpartmodel9_NOTempCombElevPrcpSea<-rpart(PM2.5~NO+NO2+NOx+Season+tavg+prcp+Elevation, data=City_AQ_WTHR_POS, cp=0.01)
fmodel(rpartmodel9_NOTempCombElevPrcpSea)

prp(rpartmodel9_NOTempCombElevPrcpSea)

lmmodel10_NOTempCombElevPrcpSeaAQI<-lm(PM2.5~NO+NO2+NOx+Season+tavg+prcp+Elevation+AQI, City_AQ_WTHR_POS)
fmodel(lmmodel10_NOTempCombElevPrcpSeaAQI)

rpartmodel10_NOTempCombElevPrcpSeaAQI<-rpart(PM2.5~NO+NO2+NOx+Season+tavg+prcp+Elevation+AQI, data=City_AQ_WTHR_POS, cp=0.01)
fmodel(rpartmodel10_NOTempCombElevPrcpSeaAQI)

prp(rpartmodel10_NOTempCombElevPrcpSeaAQI)

lmmodel11_AQI<-lm(PM2.5~AQI, City_AQ_WTHR_POS)
fmodel(lmmodel11_AQI)

rpartmodel11_AQI<-rpart(PM2.5~AQI, data=City_AQ_WTHR_POS, cp=0.01)
fmodel(rpartmodel11_AQI)

prp(rpartmodel11_AQI)

lmmodel1_output <- data.frame(evaluate_model(lmmodel1_NO, City_AQ_WTHR_POS))%>%select(PM2.5, model_output)
lmmodel1_output<-lmmodel1_output %>% mutate(diff =(lmmodel1_output$PM2.5 - lmmodel1_output$model_output))
lmmodel1_MSE<-mean((lmmodel1_output$diff)^2, na.rm=TRUE)

rpartmodel1_output <- evaluate_model(rpartmode11_NO, City_AQ_WTHR_POS)%>%select(PM2.5, model_output)
rpartmodel1_output<-rpartmodel1_output%>%mutate(diff =(rpartmodel1_output$PM2.5-rpartmodel1_output$model_output))
rpartmodel1_MSE<-mean((rpartmodel1_output$diff)^2, na.rm=TRUE)

lmmodel2_output <- data.frame(evaluate_model(lmmodel2_NOSea, City_AQ_WTHR_POS))%>%select(PM2.5, model_output)
lmmodel2_output<-lmmodel2_output %>% mutate(diff =(lmmodel2_output$PM2.5 - lmmodel2_output$model_output))
lmmodel2_MSE<-mean((lmmodel2_output$diff)^2, na.rm=TRUE)

rpartmodel2_output <- evaluate_model(rpartmode12_NOSea, City_AQ_WTHR_POS)%>%select(PM2.5, model_output)
rpartmodel2_output<-rpartmodel2_output%>%mutate(diff =(rpartmodel2_output$PM2.5-rpartmodel2_output$model_output))
rpartmodel2_MSE<-mean((rpartmodel2_output$diff)^2, na.rm=TRUE)

lmmodel3_output <- data.frame(evaluate_model(lmmodel3_NO2Sea, City_AQ_WTHR_POS))%>%select(PM2.5, model_output)
lmmodel3_output<-lmmodel3_output %>% mutate(diff =(lmmodel3_output$PM2.5 - lmmodel3_output$model_output))
lmmodel3_MSE<-mean((lmmodel3_output$diff)^2, na.rm=TRUE)

rpartmodel3_output <- evaluate_model(rpartmode13_NO2Sea, City_AQ_WTHR_POS)%>%select(PM2.5, model_output)
rpartmodel3_output<-rpartmodel3_output%>%mutate(diff =(rpartmodel3_output$PM2.5-rpartmodel3_output$model_output))
rpartmodel3_MSE<-mean((rpartmodel3_output$diff)^2, na.rm=TRUE)

lmmodel4_output <- data.frame(evaluate_model(lmmodel4_NOXSea, City_AQ_WTHR_POS))%>%select(PM2.5, model_output)
lmmodel4_output<-lmmodel4_output %>% mutate(diff =(lmmodel4_output$PM2.5 - lmmodel4_output$model_output))
lmmodel4_MSE<-mean((lmmodel4_output$diff)^2, na.rm=TRUE)

rpartmodel4_output <- evaluate_model(rpartmode14_NOXSea, City_AQ_WTHR_POS)%>%select(PM2.5, model_output)
rpartmodel4_output<-rpartmodel4_output%>%mutate(diff =(rpartmodel4_output$PM2.5-rpartmodel4_output$model_output))
rpartmodel4_MSE<-mean((rpartmodel4_output$diff)^2, na.rm=TRUE)

lmmodel5_output <- data.frame(evaluate_model(lmmodel5_NOCombSea, City_AQ_WTHR_POS))%>%select(PM2.5, model_output)
lmmodel5_output<-lmmodel5_output %>% mutate(diff =(lmmodel5_output$PM2.5 - lmmodel5_output$model_output))
lmmodel5_MSE<-mean((lmmodel5_output$diff)^2, na.rm=TRUE)


rpartmodel5_output <- evaluate_model(rpartmode15_NOCombSea, City_AQ_WTHR_POS)%>%select(PM2.5, model_output)
rpartmodel5_output<-rpartmodel5_output%>%mutate(diff =(rpartmodel5_output$PM2.5-rpartmodel5_output$model_output))
rpartmodel5_MSE<-mean((rpartmodel5_output$diff)^2, na.rm=TRUE)
print(paste("MSE of rpart model trained with only NO + NO2 + NOx + Season for training data set", rpartmodel5_MSE))
## [1] "MSE of rpart model trained with only NO + NO2 + NOx + Season for training data set 710.426155551455"
lmmodel6_output <- data.frame(evaluate_model(lmmodel6_NOTempCombSea, City_AQ_WTHR_POS))%>%select(PM2.5, model_output)
lmmodel6_output<-lmmodel6_output %>% mutate(diff =(lmmodel6_output$PM2.5 - lmmodel6_output$model_output))
lmmodel6_MSE<-mean((lmmodel6_output$diff)^2, na.rm=TRUE)

rpartmodel6_output <- evaluate_model(rpartmode16_NOTempCombSea, City_AQ_WTHR_POS)%>%select(PM2.5, model_output)
rpartmodel6_output<-rpartmodel6_output%>%mutate(diff =(rpartmodel6_output$PM2.5-rpartmodel6_output$model_output))
rpartmodel6_MSE<-mean((rpartmodel6_output$diff)^2, na.rm=TRUE)

lmmodel7_output <- data.frame(evaluate_model(lmmodel7_NOTempCombPrcPSea, City_AQ_WTHR_POS))%>%select(PM2.5, model_output)
lmmodel7_output<-lmmodel7_output %>% mutate(diff =(lmmodel7_output$PM2.5 - lmmodel7_output$model_output))
lmmodel7_MSE<-mean((lmmodel7_output$diff)^2, na.rm=TRUE)

rpartmodel7_output <- evaluate_model(rpartmodel7_NOTempCombPrcPSea, City_AQ_WTHR_POS)%>%select(PM2.5, model_output)
rpartmodel7_output<-rpartmodel7_output%>%mutate(diff =(rpartmodel7_output$PM2.5-rpartmodel7_output$model_output))
rpartmodel7_MSE<-mean((rpartmodel7_output$diff)^2, na.rm=TRUE)

lmmodel8_output <- data.frame(evaluate_model(lmmodel8_NOCombElevPSea, City_AQ_WTHR_POS))%>%select(PM2.5, model_output)
lmmodel8_output<-lmmodel8_output %>% mutate(diff =(lmmodel8_output$PM2.5 - lmmodel8_output$model_output))
lmmodel8_MSE<-mean((lmmodel8_output$diff)^2, na.rm=TRUE)

rpartmodel8_output <- evaluate_model(rpartmodel8_NOCombElevPSea, City_AQ_WTHR_POS)%>%select(PM2.5, model_output)
rpartmodel8_output<-rpartmodel8_output%>%mutate(diff =(rpartmodel8_output$PM2.5-rpartmodel8_output$model_output))
rpartmodel8_MSE<-mean((rpartmodel8_output$diff)^2, na.rm=TRUE)

lmmodel9_output <- data.frame(evaluate_model(lmmodel9_NOTempCombElevPrcpSea, City_AQ_WTHR_POS))%>%select(PM2.5, model_output)
lmmodel9_output<-lmmodel9_output %>% mutate(diff =(lmmodel9_output$PM2.5 - lmmodel9_output$model_output))
lmmodel9_MSE<-mean((lmmodel9_output$diff)^2, na.rm=TRUE)

rpartmodel9_output <- evaluate_model(rpartmodel9_NOTempCombElevPrcpSea, City_AQ_WTHR_POS)%>%select(PM2.5, model_output)
rpartmodel9_output<-rpartmodel9_output%>%mutate(diff =(rpartmodel9_output$PM2.5-rpartmodel9_output$model_output))
rpartmodel9_MSE<-mean((rpartmodel9_output$diff)^2, na.rm=TRUE)

lmmodel10_output <- data.frame(evaluate_model(lmmodel10_NOTempCombElevPrcpSeaAQI, City_AQ_WTHR_POS))%>%select(PM2.5, model_output)
lmmodel10_output<-lmmodel10_output %>% mutate(diff =(lmmodel10_output$PM2.5 - lmmodel10_output$model_output))
lmmodel10_MSE<-mean((lmmodel10_output$diff)^2, na.rm=TRUE)

rpartmodel10_output <- evaluate_model(rpartmodel10_NOTempCombElevPrcpSeaAQI, City_AQ_WTHR_POS)%>%select(PM2.5, model_output)
rpartmodel10_output<-rpartmodel10_output%>%mutate(diff =(rpartmodel10_output$PM2.5-rpartmodel10_output$model_output))
rpartmodel10_MSE<-mean((rpartmodel10_output$diff)^2, na.rm=TRUE)


lmmodel11_output <- data.frame(evaluate_model(lmmodel11_AQI, City_AQ_WTHR_POS))%>%select(PM2.5, model_output)
lmmodel11_output<-lmmodel11_output %>% mutate(diff =(lmmodel11_output$PM2.5 - lmmodel11_output$model_output))
lmmodel11_MSE<-mean((lmmodel11_output$diff)^2, na.rm=TRUE)

rpartmodel11_output <- evaluate_model(rpartmodel11_AQI, City_AQ_WTHR_POS)%>%select(PM2.5, model_output)
rpartmodel11_output<-rpartmodel11_output%>%mutate(diff =(rpartmodel11_output$PM2.5-rpartmodel11_output$model_output))
rpartmodel11_MSE<-mean((rpartmodel11_output$diff)^2, na.rm=TRUE)


print(paste("MSE of lm model trained with only NO for training data set", lmmodel1_MSE))
## [1] "MSE of lm model trained with only NO for training data set 1061.33499331602"
print(paste("MSE of rpart model trained with only NO for training data set", rpartmodel1_MSE))
## [1] "MSE of rpart model trained with only NO for training data set 987.1894333002"
print(paste("MSE of lm model trained with only NO + Season for training data set", lmmodel2_MSE))
## [1] "MSE of lm model trained with only NO + Season for training data set 918.333727467041"
print(paste("MSE of rpart model trained with only NO + Season for training data set", rpartmodel2_MSE))
## [1] "MSE of rpart model trained with only NO + Season for training data set 876.496881540318"
print(paste("MSE of lm model trained with only NO2 + Season for  training data set", lmmodel3_MSE))
## [1] "MSE of lm model trained with only NO2 + Season for  training data set 1340.46177198892"
print(paste("MSE of rpart model trained with only NO2 + Season for training data set", rpartmodel3_MSE))
## [1] "MSE of rpart model trained with only NO2 + Season for training data set 776.716479521588"
print(paste("MSE of lm model trained with only NOX + Season for  training data set", lmmodel4_MSE))
## [1] "MSE of lm model trained with only NOX + Season for  training data set 1412.41811010542"
print(paste("MSE of rpart model trained with only NOX + Season for training data set", rpartmodel4_MSE))
## [1] "MSE of rpart model trained with only NOX + Season for training data set 1249.88681402177"
print(paste("MSE of lm model trained with only NO + NO2 + NOx + Season for  training data set", lmmodel5_MSE))
## [1] "MSE of lm model trained with only NO + NO2 + NOx + Season for  training data set 802.808464115074"
print(paste("MSE of lm model trained with only NO + NO2 + NOx + tempavg + Season for  training data set", lmmodel6_MSE))
## [1] "MSE of lm model trained with only NO + NO2 + NOx + tempavg + Season for  training data set 631.673468162641"
print(paste("MSE of rpart model trained with only NO + NO2 + NOx + tempavg + Season for training data set", rpartmodel6_MSE))
## [1] "MSE of rpart model trained with only NO + NO2 + NOx + tempavg + Season for training data set 727.900589372769"
print(paste("MSE of lm model trained with only NO + NO2 + NOx + tempavg + Precipitation + Season for  training data set", lmmodel7_MSE))
## [1] "MSE of lm model trained with only NO + NO2 + NOx + tempavg + Precipitation + Season for  training data set 617.745498296348"
print(paste("MSE of rpart model trained with only NO + NO2 + NOx + tempavg +  Precipitation + Season for training data set", rpartmodel7_MSE))
## [1] "MSE of rpart model trained with only NO + NO2 + NOx + tempavg +  Precipitation + Season for training data set 727.900589372769"
print(paste("MSE of lm model trained with only NO + NO2 + NOx + Elevation + Season for  training data set", lmmodel8_MSE))
## [1] "MSE of lm model trained with only NO + NO2 + NOx + Elevation + Season for  training data set 720.697686496395"
print(paste("MSE of rpart model trained with only NO + NO2 + NOx + Elevation + Season for training data set", rpartmodel8_MSE))
## [1] "MSE of rpart model trained with only NO + NO2 + NOx + Elevation + Season for training data set 710.426155551455"
print(paste("MSE of lm model trained with only NO + NO2 + NOx + Temp avg + Precipitation + Elevation + Season for  training data set", lmmodel9_MSE))
## [1] "MSE of lm model trained with only NO + NO2 + NOx + Temp avg + Precipitation + Elevation + Season for  training data set 462.738773231392"
print(paste("MSE of rpart model trained with only NO + NO2 + NOx + Temp avg + Precipitation + Elevation + Season for training data set", rpartmodel9_MSE))
## [1] "MSE of rpart model trained with only NO + NO2 + NOx + Temp avg + Precipitation + Elevation + Season for training data set 717.825390944258"
print(paste("MSE of lm model trained with only NO + NO2 + NOx + Temp avg + Precipitation + Elevation + Season + AQI for  training data set", lmmodel10_MSE))
## [1] "MSE of lm model trained with only NO + NO2 + NOx + Temp avg + Precipitation + Elevation + Season + AQI for  training data set 106.871067357215"
print(paste("MSE of rpart model trained with only NO + NO2 + NOx + Temp avg + Precipitation + Elevation + Season + AQI for training data set", rpartmodel10_MSE))
## [1] "MSE of rpart model trained with only NO + NO2 + NOx + Temp avg + Precipitation + Elevation + Season + AQI for training data set 212.265984562307"
print(paste("MSE of lm model trained with only AQI for  training data set", lmmodel11_MSE))
## [1] "MSE of lm model trained with only AQI for  training data set 210.920482258775"
print(paste("MSE of rpart model trained with only AQI for training data set", rpartmodel11_MSE))
## [1] "MSE of rpart model trained with only AQI for training data set 212.265984562307"
lmmodel1_output <- data.frame(evaluate_model(lmmodel1_NO, major_city_hour_test_data_final))%>%select(PM2.5, model_output)
lmmodel1_output<-lmmodel1_output %>% mutate(diff =(lmmodel1_output$PM2.5 - lmmodel1_output$model_output))
lmmodel1_MSE<-mean((lmmodel1_output$diff)^2,na.rm=TRUE)

rpartmodel1_output <- evaluate_model(rpartmode11_NO, major_city_hour_test_data_final)%>%select(PM2.5, model_output)
rpartmodel1_output<-rpartmodel1_output%>%mutate(diff =(rpartmodel1_output$PM2.5-rpartmodel1_output$model_output))
rpartmodel1_MSE<-mean((rpartmodel1_output$diff)^2,na.rm=TRUE)

lmmodel2_output <- data.frame(evaluate_model(lmmodel2_NOSea, major_city_hour_test_data_final))%>%select(PM2.5, model_output)
lmmodel2_output<-lmmodel2_output %>% mutate(diff =(lmmodel2_output$PM2.5 - lmmodel2_output$model_output))
lmmodel2_MSE<-mean((lmmodel2_output$diff)^2,na.rm=TRUE)

rpartmodel2_output <- evaluate_model(rpartmode12_NOSea, major_city_hour_test_data_final)%>%select(PM2.5, model_output)
rpartmodel2_output<-rpartmodel2_output%>%mutate(diff =(rpartmodel2_output$PM2.5-rpartmodel2_output$model_output))
rpartmodel2_MSE<-mean((rpartmodel2_output$diff)^2,na.rm=TRUE)

lmmodel3_output <- data.frame(evaluate_model(lmmodel3_NO2Sea, major_city_hour_test_data_final))%>%select(PM2.5, model_output)
lmmodel3_output<-lmmodel3_output %>% mutate(diff =(lmmodel3_output$PM2.5 - lmmodel3_output$model_output))
lmmodel3_MSE<-mean((lmmodel3_output$diff)^2,na.rm=TRUE)

rpartmodel3_output <- evaluate_model(rpartmode13_NO2Sea, major_city_hour_test_data_final)%>%select(PM2.5, model_output)
rpartmodel3_output<-rpartmodel3_output%>%mutate(diff =(rpartmodel3_output$PM2.5-rpartmodel3_output$model_output))
rpartmodel3_MSE<-mean((rpartmodel3_output$diff)^2,na.rm=TRUE)

lmmodel4_output <- data.frame(evaluate_model(lmmodel4_NOXSea, major_city_hour_test_data_final))%>%select(PM2.5, model_output)
lmmodel4_output<-lmmodel4_output %>% mutate(diff =(lmmodel4_output$PM2.5 - lmmodel4_output$model_output))
lmmodel4_MSE<-mean((lmmodel4_output$diff)^2,na.rm=TRUE)

rpartmodel4_output <- evaluate_model(rpartmode14_NOXSea, major_city_hour_test_data_final)%>%select(PM2.5, model_output)
rpartmodel4_output<-rpartmodel4_output%>%mutate(diff =(rpartmodel4_output$PM2.5-rpartmodel4_output$model_output))
rpartmodel4_MSE<-mean((rpartmodel4_output$diff)^2,na.rm=TRUE)

lmmodel5_output <- data.frame(evaluate_model(lmmodel5_NOCombSea, major_city_hour_test_data_final))%>%select(PM2.5, model_output)
lmmodel5_output<-lmmodel5_output %>% mutate(diff =(lmmodel5_output$PM2.5 - lmmodel5_output$model_output))
lmmodel5_MSE<-mean((lmmodel5_output$diff)^2,na.rm=TRUE)


rpartmodel5_output <- evaluate_model(rpartmode15_NOCombSea, major_city_hour_test_data_final)%>%select(PM2.5, model_output)
rpartmodel5_output<-rpartmodel5_output%>%mutate(diff =(rpartmodel5_output$PM2.5-rpartmodel5_output$model_output))
rpartmodel5_MSE<-mean((rpartmodel5_output$diff)^2,na.rm=TRUE)
print(paste("MSE of rpart model trained with only NO + NO2 + NOx + Season for test data set", rpartmodel5_MSE))
## [1] "MSE of rpart model trained with only NO + NO2 + NOx + Season for test data set 770.051250699718"
lmmodel6_output <- data.frame(evaluate_model(lmmodel6_NOTempCombSea, major_city_hour_test_data_final))%>%select(PM2.5, model_output)
lmmodel6_output<-lmmodel6_output %>% mutate(diff =(lmmodel6_output$PM2.5 - lmmodel6_output$model_output))
lmmodel6_MSE<-mean((lmmodel6_output$diff)^2,na.rm=TRUE)

rpartmodel6_output <- evaluate_model(rpartmode16_NOTempCombSea, major_city_hour_test_data_final)%>%select(PM2.5, model_output)
rpartmodel6_output<-rpartmodel6_output%>%mutate(diff =(rpartmodel6_output$PM2.5-rpartmodel6_output$model_output))
rpartmodel6_MSE<-mean((rpartmodel6_output$diff)^2,na.rm=TRUE)

lmmodel7_output <- data.frame(evaluate_model(lmmodel7_NOTempCombPrcPSea, major_city_hour_test_data_final))%>%select(PM2.5, model_output)
lmmodel7_output<-lmmodel7_output %>% mutate(diff =(lmmodel7_output$PM2.5 - lmmodel7_output$model_output))
lmmodel7_MSE<-mean((lmmodel7_output$diff)^2,na.rm=TRUE)

rpartmodel7_output <- evaluate_model(rpartmodel7_NOTempCombPrcPSea, major_city_hour_test_data_final)%>%select(PM2.5, model_output)
rpartmodel7_output<-rpartmodel7_output%>%mutate(diff =(rpartmodel7_output$PM2.5-rpartmodel7_output$model_output))
rpartmodel7_MSE<-mean((rpartmodel7_output$diff)^2,na.rm=TRUE)

lmmodel8_output <- data.frame(evaluate_model(lmmodel8_NOCombElevPSea, major_city_hour_test_data_final))%>%select(PM2.5, model_output)
lmmodel8_output<-lmmodel8_output %>% mutate(diff =(lmmodel8_output$PM2.5 - lmmodel8_output$model_output))
lmmodel8_MSE<-mean((lmmodel8_output$diff)^2,na.rm=TRUE)

rpartmodel8_output <- evaluate_model(rpartmodel8_NOCombElevPSea, major_city_hour_test_data_final)%>%select(PM2.5, model_output)
rpartmodel8_output<-rpartmodel8_output%>%mutate(diff =(rpartmodel8_output$PM2.5-rpartmodel8_output$model_output))
rpartmodel8_MSE<-mean((rpartmodel8_output$diff)^2,na.rm=TRUE)

lmmodel9_output <- data.frame(evaluate_model(lmmodel9_NOTempCombElevPrcpSea, major_city_hour_test_data_final))%>%select(PM2.5, model_output)
lmmodel9_output<-lmmodel9_output %>% mutate(diff =(lmmodel9_output$PM2.5 - lmmodel9_output$model_output))
lmmodel9_MSE<-mean((lmmodel9_output$diff)^2,na.rm=TRUE)

rpartmodel9_output <- evaluate_model(rpartmodel9_NOTempCombElevPrcpSea, major_city_hour_test_data_final)%>%select(PM2.5, model_output)
rpartmodel9_output<-rpartmodel9_output%>%mutate(diff =(rpartmodel9_output$PM2.5-rpartmodel9_output$model_output))
rpartmodel9_MSE<-mean((rpartmodel9_output$diff)^2,na.rm=TRUE)


lmmodel10_output <- data.frame(evaluate_model(lmmodel10_NOTempCombElevPrcpSeaAQI, major_city_hour_test_data_final))%>%select(PM2.5, model_output)
lmmodel10_output<-lmmodel10_output %>% mutate(diff =(lmmodel10_output$PM2.5 - lmmodel10_output$model_output))
lmmodel10_MSE<-mean((lmmodel10_output$diff)^2, na.rm=TRUE)

rpartmodel10_output <- evaluate_model(rpartmodel10_NOTempCombElevPrcpSeaAQI, major_city_hour_test_data_final)%>%select(PM2.5, model_output)
rpartmodel10_output<-rpartmodel10_output%>%mutate(diff =(rpartmodel10_output$PM2.5-rpartmodel10_output$model_output))
rpartmodel10_MSE<-mean((rpartmodel10_output$diff)^2, na.rm=TRUE)

lmmodel11_output <- data.frame(evaluate_model(lmmodel11_AQI, major_city_hour_test_data_final))%>%select(PM2.5, model_output)
lmmodel11_output<-lmmodel11_output %>% mutate(diff =(lmmodel11_output$PM2.5 - lmmodel11_output$model_output))
lmmodel11_MSE<-mean((lmmodel11_output$diff)^2, na.rm=TRUE)

rpartmodel11_output <- evaluate_model(rpartmodel11_AQI, major_city_hour_test_data_final)%>%select(PM2.5, model_output)
rpartmodel11_output<-rpartmodel11_output%>%mutate(diff =(rpartmodel11_output$PM2.5-rpartmodel11_output$model_output))
rpartmodel11_MSE<-mean((rpartmodel11_output$diff)^2, na.rm=TRUE)

print(paste("MSE of lm model trained with only NO for test data set", lmmodel1_MSE))
## [1] "MSE of lm model trained with only NO for test data set 1027.64415372642"
print(paste("MSE of rpart model trained with only NO for test data set", rpartmodel1_MSE))
## [1] "MSE of rpart model trained with only NO for test data set 958.547982922875"
print(paste("MSE of lm model trained with only NO + Season for test data set", lmmodel2_MSE))
## [1] "MSE of lm model trained with only NO + Season for test data set 896.715308139762"
print(paste("MSE of rpart model trained with only NO + Season for test data set", rpartmodel2_MSE))
## [1] "MSE of rpart model trained with only NO + Season for test data set 985.077280399531"
print(paste("MSE of lm model trained with only NO2 + Season for  test data set", lmmodel3_MSE))
## [1] "MSE of lm model trained with only NO2 + Season for  test data set 1187.39887983887"
print(paste("MSE of rpart model trained with only NO2 + Season for test data set", rpartmodel3_MSE))
## [1] "MSE of rpart model trained with only NO2 + Season for test data set 819.395413552255"
print(paste("MSE of lm model trained with only NOX + Season for  test data set", lmmodel4_MSE))
## [1] "MSE of lm model trained with only NOX + Season for  test data set 1778.98277381702"
print(paste("MSE of rpart model trained with only NOX + Season for test data set", rpartmodel4_MSE))
## [1] "MSE of rpart model trained with only NOX + Season for test data set 1877.07924764811"
print(paste("MSE of lm model trained with only NO + NO2 + NOx + Season for  test data set", lmmodel5_MSE))
## [1] "MSE of lm model trained with only NO + NO2 + NOx + Season for  test data set 1102.21594202083"
print(paste("MSE of lm model trained with only NO + NO2 + NOx + tempavg + Season for  test data set", lmmodel6_MSE))
## [1] "MSE of lm model trained with only NO + NO2 + NOx + tempavg + Season for  test data set 688.910782291019"
print(paste("MSE of rpart model trained with only NO + NO2 + NOx + tempavg + Season for test data set", rpartmodel6_MSE))
## [1] "MSE of rpart model trained with only NO + NO2 + NOx + tempavg + Season for test data set 776.144394669279"
print(paste("MSE of lm model trained with only NO + NO2 + NOx + tempavg + Precipitation + Season for  test data set", lmmodel7_MSE))
## [1] "MSE of lm model trained with only NO + NO2 + NOx + tempavg + Precipitation + Season for  test data set 671.858921328259"
print(paste("MSE of rpart model trained with only NO + NO2 + NOx + tempavg +  Precipitation + Season for test data set", rpartmodel7_MSE))
## [1] "MSE of rpart model trained with only NO + NO2 + NOx + tempavg +  Precipitation + Season for test data set 776.144394669279"
print(paste("MSE of lm model trained with only NO + NO2 + NOx + Elevation + Season for  test data set", lmmodel8_MSE))
## [1] "MSE of lm model trained with only NO + NO2 + NOx + Elevation + Season for  test data set 1129.76725458874"
print(paste("MSE of rpart model trained with only NO + NO2 + NOx + Elevation + Season for test data set", rpartmodel8_MSE))
## [1] "MSE of rpart model trained with only NO + NO2 + NOx + Elevation + Season for test data set 770.051250699718"
print(paste("MSE of lm model trained with only NO + NO2 + NOx + Temp avg + Precipitation + Elevation + Season for  test data set", lmmodel9_MSE))
## [1] "MSE of lm model trained with only NO + NO2 + NOx + Temp avg + Precipitation + Elevation + Season for  test data set 508.743362701904"
print(paste("MSE of rpart model trained with only NO + NO2 + NOx + Temp avg + Precipitation + Elevation + Season for test data set", rpartmodel9_MSE))
## [1] "MSE of rpart model trained with only NO + NO2 + NOx + Temp avg + Precipitation + Elevation + Season for test data set 765.364054382867"
print(paste("MSE of lm model trained with only NO + NO2 + NOx + Temp avg + Precipitation + Elevation + Season + AQI for  test data set", lmmodel10_MSE))
## [1] "MSE of lm model trained with only NO + NO2 + NOx + Temp avg + Precipitation + Elevation + Season + AQI for  test data set 188.852974998694"
print(paste("MSE of rpart model trained with only NO + NO2 + NOx + Temp avg + Precipitation + Elevation + Season + AQI for test data set", rpartmodel10_MSE))
## [1] "MSE of rpart model trained with only NO + NO2 + NOx + Temp avg + Precipitation + Elevation + Season + AQI for test data set 247.40355272891"
print(paste("MSE of lm model trained with only AQI for  test data set", lmmodel11_MSE))
## [1] "MSE of lm model trained with only AQI for  test data set 205.748333922921"
print(paste("MSE of rpart model trained with only AQI for test data set", rpartmodel11_MSE))
## [1] "MSE of rpart model trained with only AQI for test data set 247.40355272891"

Outcomes

Outcomes for the objectives and analysis performed.

##1. Benford’s law Conformity Check: Benford’s law conformity check function was implemented with basic functions. Comparing the actual probability of each leading digits with Benford’s probability. Due to incompleteness of the dataset available, most of the data set didnt confirm to Benfordness. station_day dataset had few of the pollutants data with close or acceptable confirmity.

##2. AQI of Bengaluru, Chennai, Delhi and Mumbai for the period 2015-2020: With city_day dataset, averaged over months and plotted AQI for over different months over the years. Using graph visualization, drawn inferences.

##3. Analysis of different pollutants for major cities over different years: With city_day dataset, plotted pollutants for different years. With graphs, derived the inferences.

##4. Analysis of different pollutants across different Indian seasons over the day during different hours: With station_day cleaned and benfordness checked dataset, plotted graphs for chennai and Bengaluru over different hours for the day and derived inferences for pollutants with different graphs

##5. Understanding the Impact of different pollutants, temperature, season, precipitation on PM2.5: Created a combined data set with cleaned station_day, wether and geodata plotted different graphs to study the relation or impact other variables over PM2.5.Able to derive clear inferences.

##6. Building different prediction model for PM2.5 using key pollutants, weather info and season for major cities using lm and rpart modeling: Built 22 different models using lm and rpart with different set explanatory variables set (1. NO, 2. NO + Season,3.NO2 + Season, 4.NOx + Season, 5.NO+ NO2 + NOx + Season, 6.NO+ NO2 + NOx + Season + tavg, 7.NO+ NO2 + NOx + Season + tavg + prcp,8. NO+ NO2 + NOx + Season + Elevation, 9.NO+ NO2 + NOx + Season + tavg + prcp + Elevation, 10.NO+ NO2 + NOx + Season + tavg + prcp + Elevation + AQI, 11.AQI). The model is evaluated using both training data set and test set. The test data was prepared by combining cleaned city_day, wethear and geo info of major cities. MSE error for each of the model prediction calculated and stored for comparison. fmodel and prp graphs were plotted for study.

##7. Evaluation and comparison of performance of different models and selecting a optimal one: Comparing all the model results and MSE of each of the 22 model trained using LM and RPART, the model trained using explanatory variables NO+ NO2 + NOx + Season + tavg + prcp + Elevation + AQ yield better results. The below graph show actual vs prediction for both lm model and rpart model. lm model had MSE of 188.852974998694 for test data set and 106.871067357215 for training dataset.

#lmmodel10_output
ggplot(lmmodel10_output, aes(x=model_output, y= PM2.5)) +
  geom_point(na.rm=TRUE) +
  geom_abline(intercept=0, slope=1) +
  labs(x='Predicted Values', y='Actual Values', title='LMPredictedvsActual\nNO+NO2+NOx+Tempavg+Precipitation+Elevation+Season+AQI')

ggplot(rpartmodel10_output, aes(x=model_output, y= PM2.5)) +
  geom_point(na.rm=TRUE) +
  geom_abline(intercept=0, slope=1) +
  labs(x='Predicted Values', y='Actual Values', title='RPARTPredictedvsActual\nNO+NO2+NOx+Tempavg+Precipitation+Elevation+Season+AQI')

Results and Discussions

Results and inferences

##1. Benford’s law Conformity Check: PM2.5, AQI and few other pollutants data from city_day and station_day was checked for Benfordness.PM2.5 from station_day dataset had acceptable conformity and NO from station_day had close conformity. Rest of the data failed the Benfordness check as there were lot of missing info. This could have impacted the modeling. Hence used station_day dataset for training as it was better than city_day data. Used city_day dataset as test set.

##2. AQI of Bengaluru, Chennai, Delhi and Mumbai for the period 2015-2020: AQI over different months for major cities for the years(2015-2020) were highly varying. Delhi had very high AQI compared to other cities. Chennai and Bangalore almost similar AQI. The AQI is high during start of the year and really high during end of the year. Across different cities the common observation over the given years is that the AQI has improved for the respective city AQI benchmarks compare to past. The AQI is been varying across different months. This clearly indicated that seasonal changes and weather has impact on the AQI.

##3. Analysis of different pollutants for major cities over different years: Delhi has high AQI and higher concentration of pollutants for the given period. Bangalore has low concentration of PM2.5 and SO2. Mumbai has higher concentraion of NO, NO2 and NOx than Chennai and Bangalore. Chennai PM2.5 and other pollutant concentraion is neither high or less compared to other cities. CO concentration has reduced in all cities over the years. Mumbai has lowest O3 concentration. PM2.5 data Mumbai doesnt appear to be reliable as is unusually low.

##4. Analysis of different pollutants across different Indian seasons over the day during different hours: Chennai has low PM2.5 concentration during summer whereas Bengaluru has low concentration during monsoon. PM2.5 is high during morning hours till afternoon. Chennai has high O3 concentration during monsoon and summer whereas Bengaluru has during Spring and prewinter. O3 is high through the day for chennai but for Bengaluru peaks durign mid night and afternoon. For both Chennai and Bengaluru, CO is very high during monsoon and high during winter and it peaks first half of the day and drops during second of the day.NO is high during early morning and low during afternoons for Chennai and Bengaluru it peaks during early morning and late night. Bengaluru has high NO during winter and prewinter.SO2 varies through out the day, high during winter and low during summer for Chennai and for both Chennai and Bengaluru.SO2 varies through out the day for Chennai and Bengaluru. AQI is high during monsoon and low during summer and peaks during afternoon till midnight for Chennai. For Bengaluru, AQI drops during early morning and peaks in the afternoon, high during spring and low during monsoon.

##5. Understanding the Impact of different pollutants, temperature, season, precipitation on PM2.5: PM2.5 is higher in the order (Highest, Higher, High) seasons of Autumn, Prewinter and Winter. During Monsoons, PM2.5 is generally lowest with low NO,AQI, NOx, and NO2. Higher temperature, precipitation and elevation has associated low PM2.5. ##6. Building different prediction model for PM2.5 using key pollutants, weather info and season for major cities using lm and rpart modeling: Built 22 different models using lm and rpart with different set explanatory variables set (1. NO, 2. NO + Season,3.NO2 + Season, 4.NOx + Season, 5.NO+ NO2 + NOx + Season, 6.NO+ NO2 + NOx + Season + tavg, 7.NO+ NO2 + NOx + Season + tavg + prcp,8. NO+ NO2 + NOx + Season + Elevation, 9.NO+ NO2 + NOx + Season + tavg + prcp + Elevation, 10.NO+ NO2 + NOx + Season + tavg + prcp + Elevation + AQI, 11.AQI). The model is evaluated using both training data set and test set. The test data was prepared by combining cleaned city_day, wethear and geo info of major cities. MSE error for each of the model prediction calculated and stored for comparison. lm models were performing better than the rpart model.

##7. Evaluation and comparison of performance of different models and selecting a optimal one: Comparing all the model results and MSE of each of the 22 model trained using LM and RPART, the model trained using explanatory variables NO+ NO2 + NOx + Season + tavg + prcp + Elevation + AQ yield better results. lm model had MSE of 188.852974998694 for test data set and 106.871067357215 for training dataset.

##8.Future Work: The following objectives were not taken due to lack of suitable dataset and these can be considered for future work 1. Not all major pollutants are correlated with PM2.5, further analysis can be performed to build further optimal and cities related parameters like lat, lan, populatio, mobility, solid waste etc. 2. This can analysis can be further extended to see if there could be impact of green cover or tree census growth over the years on PM2.5 as trees play an important role in purifying air. 3. Study can be done to predict cardio vascular incidence in a give city with give pollutants details.